InstruceBLIP

c04f261a · dongchy920 · c04f261a · c04f261a · c04f261a · c04f261a
Commit c04f261a authored Aug 22, 2024 by dongchy920
20 changed files
--- a/app/vqa.py
+++ b/app/vqa.py
+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import streamlit as st
+from app import load_demo_image, device
+from app.utils import load_model_cache
+from lavis.processors import load_processor
+from PIL import Image
+def app():
+    model_type = st.sidebar.selectbox("Model:", ["BLIP"])
+    # ===== layout =====
+    st.markdown(
+        "<h1 style='text-align: center;'>Visual Question Answering</h1>",
+        unsafe_allow_html=True,
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    col1, col2 = st.columns(2)
+    col1.header("Image")
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col2.header("Question")
+    user_question = col2.text_input("Input your question!", "What are objects there?")
+    qa_button = st.button("Submit")
+    col2.header("Answer")
+    # ===== event =====
+    vis_processor = load_processor("blip_image_eval").build(image_size=480)
+    text_processor = load_processor("blip_question").build()
+    if qa_button:
+        if model_type.startswith("BLIP"):
+            model = load_model_cache(
+                "blip_vqa", model_type="vqav2", is_eval=True, device=device
+            )
+            img = vis_processor(raw_img).unsqueeze(0).to(device)
+            question = text_processor(user_question)
+            vqa_samples = {"image": img, "text_input": [question]}
+            answers = model.predict_answers(vqa_samples, inference_method="generate")
+            col2.write("\n".join(answers), use_column_width=True)
--- a/assets/BLIP.PNG
+++ b/assets/BLIP.PNG
--- a/assets/BLIP2.PNG
+++ b/assets/BLIP2.PNG
--- a/assets/InstructBLIP.png
+++ b/assets/InstructBLIP.png
--- a/assets/LAVIS_technical_report.pdf
+++ b/assets/LAVIS_technical_report.pdf
--- a/assets/demo-6.png
+++ b/assets/demo-6.png
--- a/assets/demo.png
+++ b/assets/demo.png
--- a/dataset_card/avsd_dialogue.md
+++ b/dataset_card/avsd_dialogue.md
+![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")
+# Audio-Visual Scene-Aware Dialogues (AVSD) 
+## Description
+[Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided. 
+## Task
+(https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge)
+In a **video-grounded dialogue task**, the system must generate responses to a user input in the context of a given dialog.
+This context consists of a dialog history (previous utterances by both user and system) in addition to video and audio information that comprise the scene. The quality of a system’s automatically generated sentences is evaluated using objective measures to determine whether or not the generated responses are natural and informative
+## Metrics
+Models are typically evaluated according to [BLEU](https://aclanthology.org/P02-1040/), [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf), [METEOR](https://aclanthology.org/W05-0909/), and [ROUGE-L](https://aclanthology.org/W04-1013/) metrics. 
+## Leaderboard
+TBD
+## Auto-Downloading
+Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instruction to download the dataset. 
+## References
+"Audio Visual Scene-Aware Dialog", Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
--- a/dataset_card/coco_caption.md
+++ b/dataset_card/coco_caption.md
+![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
+# Microsoft COCO Dataset (Captioning)
+## Description
+[Microsoft COCO Captions dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
+## Task
+(from https://paperswithcode.com/task/image-captioning)
+**Image captioning** is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.
+## Metrics
+Models are typically evaluated according to a [BLEU](https://aclanthology.org/P02-1040/) or [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
+## Leaderboard
+(Ranked by BLEU-4)
+| Rank |  Model  | BLEU-4 | CIDEr | METEOR | SPICE |                                                                    Resources                                                                     |
+| ---- | :-----: | :----: | :---: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |   OFA   |  44.9  | 154.9 |  32.5  | 26.6  |                                [paper](https://arxiv.org/abs/2202.03052), [code](https://github.com/OFA-Sys/OFA)                                 |
+| 2    |  LEMON  |  42.6  | 145.5 |  31.4  | 25.5  |                                                                    [paper]()                                                                     |
+| 3    |  CoCa  |   40.9   |  143.6  | 33.9 | 24.7 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
+| 4    | SimVLM  |  40.6  | 143.3 |  33.7  | 25.4  |                                                [paper](https://openreview.net/pdf?id=GUrhfTuf_3)                                                 |
+| 5    |  VinVL  |  41.0  | 140.9 |  31.1  | 25.2  |                           [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                            |
+| 6    |  OSCAR  |  40.7  | 140.0 |  30.6  | 24.5  |                           [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                            |
+| 7    |  BLIP   |  40.4  | 136.7 |  31.4  | 24.3  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
+| 8    |   M^2   |  39.1  | 131.2 |  29.2  | 22.6  |                 [paper](https://arxiv.org/pdf/1912.08226v2.pdf), [code](https://github.com/aimagelab/meshed-memory-transformer)                  |
+| 9    |  BUTD   |  36.5  | 113.5 |  27.0  | 20.3  |               [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention)                |
+| 10    | ClipCap |  32.2  | 108.4 |  27.1  | 20.1  |                     [paper](https://arxiv.org/pdf/2111.09734v1.pdf), [code](https://github.com/rmokady/clip_prefix_caption)                      |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_coco.py
+```
+## References
+"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick
--- a/dataset_card/coco_retrieval.md
+++ b/dataset_card/coco_retrieval.md
+![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
+# Microsoft COCO Dataset (Retrieval)
+## Description
+[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
+## Task
+Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
+## Metrics
+Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
+We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
+## Leaderboard
+(Ranked by TR@1.)
+| Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
+| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
+| 2    | X-VLM  | 81.2  | 95.6  | 98.2  | 63.4  | 85.8  | 91.5  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
+| 3    | ALBEF  | 77.6  | 94.3  | 97.2  | 60.7  | 84.3  | 90.5  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
+| 3    | ALIGN  | 77.0  | 93.5  | 96.9  | 59.9  | 83.3  | 89.8  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |
+| 4    | VinVL  | 75.4  | 92.9  | 96.2  | 58.8  | 83.5  | 90.3  |                                                                          [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 5    | OSCAR  | 73.5  | 92.2  | 96.0  | 57.5  | 82.8  | 89.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 6    | UNITER | 65.7  | 88.6  | 93.8  | 52.9  | 79.9  | 88.0  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_coco.py
+```
+## References
+"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick
--- a/dataset_card/conceptual_captions.md
+++ b/dataset_card/conceptual_captions.md
+![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/conceptual_captions.png)
+(image credit: https://ai.google.com/research/ConceptualCaptions/download)
+# Conceptual Captions Dataset
+## Description
+(from https://huggingface.co/datasets/conceptual_captions)
+Conceptual Captions 3M (CC3M) is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.
+Conceptual Captions 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M).
+## Task
+Image-language pre-training; image captioning.
+## Auto-Downloading
+**Warning**: images of this dataset are downloadeded by requesting URLs. Since URLs may disappear with time, it is expected the downloaded dataset is partial.
+### Conceptual Captions 3M
+- Download images
+```
+cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc3m.py
+```
+- Create annotations by running the notebook
+```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_3m.ipynb```
+### Conceptual Captions 12M
+- Download images
+```
+cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc12m.py
+```
+- Create annotations by running the notebook
+```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_12m.ipynb```
+## References
+Edwin G. Ng, Bo Pang, Piyush Sharma and Radu Soricut. 2020. Understanding Guided Image Captioning Performance Across Domains. arXiv preprint arXiv:2012.02339.
--- a/dataset_card/didemo_retrieval.md
+++ b/dataset_card/didemo_retrieval.md
+![Samples from the DiDeMo dataset.](imgs/didemo.png)(Samples from the DiDeMo dataset. Image credit: "https://www.di.ens.fr/~miech/datasetviz/")
+# DiDeMo Dataset (Retrieval)
+## Description
+[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
+## Task
+Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
+## Metrics
+Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
+We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
+## Leaderboard
+(Ranked by TR@1.)
+<!-- | Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
+| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
+| 2    | X-VLM  | 81.2  | 95.6  | 98.2  | 63.4  | 85.8  | 91.5  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
+| 3    | ALBEF  | 77.6  | 94.3  | 97.2  | 60.7  | 84.3  | 90.5  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
+| 3    | ALIGN  | 77.0  | 93.5  | 96.9  | 59.9  | 83.3  | 89.8  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |
+| 4    | VinVL  | 75.4  | 92.9  | 96.2  | 58.8  | 83.5  | 90.3  |                                                                          [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 5    | OSCAR  | 73.5  | 92.2  | 96.0  | 57.5  | 82.8  | 89.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 6    | UNITER | 65.7  | 88.6  | 93.8  | 52.9  | 79.9  | 88.0  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          | -->
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_didemo.py
+```
+## References
+Anne Hendricks, Lisa, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. "Localizing moments in video with natural language." In Proceedings of the IEEE international conference on computer vision, pp. 5803-5812. 2017.
--- a/dataset_card/flickr_retrieval.md
+++ b/dataset_card/flickr_retrieval.md
+![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/flickr30k.png)Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/")
+# Flickr30K Dataset (Retrieval)
+## Description
+[Flickr30k](https://github.com/tylin/coco-caption) dataset contains 31k+ images collected from Flickr, together with 5 reference sentences provided by human annotators.
+## Task
+Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
+## Metrics
+Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
+We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
+## Leaderboard
+(Ranked by TR@1.)
+| Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
+| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |  BLIP  | 97.2  | 99.9  | 100.0  | 87.5  | 97.7  | 98.9  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
+| 2    | X-VLM  | 97.1  | 100.0  | 100.0  | 86.9  | 97.3  | 98.7  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
+| 3    | ALBEF  | 95.9  | 99.8  | 100.0  | 85.6  | 97.5  | 98.9  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
+| 4    | ALIGN  | 95.3  | 99.8  | 100.0  | 84.9  | 97.4  | 98.6  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |                                                      |
+| 5    | VILLA  | 87.9  | 97.5  | 98.8  | 76.3  | 94.2  | 96.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 6    | UNITER | 87.3  | 98.0  | 99.2  | 75.6  | 94.1  | 96.8  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_flickr.py
+```
+## References
+Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, IJCV, 123(1):74-93, 2017. [paper]
--- a/dataset_card/gqa.md
+++ b/dataset_card/gqa.md
+![From https://arxiv.org/abs/1902.09506.pdf.](imgs/gqa.png)
+# GQA Dataset
+## Description
+(from https://cs.stanford.edu/people/dorarad/gqa/about.html)
+GQA is a VQA dataset for real-word images which requires visual, spatial and compositional reasoning. 
+It consists of 22M questions and 110K images.
+## Task
+(from https://arxiv.org/abs/1902.09506)
+Given an image and a question, the model is required to output a correct answer. 
+GQA questions require spatial understanding, multiple reasoning skills and multiple-step inference. 
+## Metrics
+The metrics are accuracy, consistency, validity, plausibility. The commonly reported metric is accuracy.
+## Leaderboard
+TBD
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_gqa.py
+```
+## References
+"GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering", Drew A. Hudson, Christopher D. Manning
\ No newline at end of file
--- a/dataset_card/imgs/NLVR2.png
+++ b/dataset_card/imgs/NLVR2.png
--- a/dataset_card/imgs/avsd_dialogue.png
+++ b/dataset_card/imgs/avsd_dialogue.png
--- a/dataset_card/imgs/coco_caption.png
+++ b/dataset_card/imgs/coco_caption.png
--- a/dataset_card/imgs/conceptual_captions.png
+++ b/dataset_card/imgs/conceptual_captions.png
--- a/dataset_card/imgs/didemo.png
+++ b/dataset_card/imgs/didemo.png
--- a/dataset_card/imgs/flickr30k.png
+++ b/dataset_card/imgs/flickr30k.png