## From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
This is the official code for Img2LLM-VQA paper. We integrate the implementation into LAVIS.
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. We propose **Img2LLM**, a plug-and-play module that provides the prompts that can bridge the these disconnections, so that LLMs can perform VQA tasks without end-to-end training.
The following images illustrate the technical procedures of answer extraction, question generation and caption prompts in Img2LLM. See paper for details.
### Demo
We include an interactive demo [Colab notebook](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
to show Img2LLM-VQA inference workflow:
1. Image-question matching: compute the relevancy score of the image patches wrt the question, and remove the generated noisy captions with low relevancy score.
2. Image captioning: generate question-guided captions based on the relevancy score.
3. Question Generation: generate questions based on the synthetic answers and captions.
4. Large Language Model: Pre-trained lagre language models, e.g. OPT/GPT-3
### Zero-Shot Evaluation
| Model | End-to-End Training? | VQAv2 val | VQAv2 test | OK-VQA test | AOK-VQA val | AOK-VQA test |
|---|---|---|---|---|---|---|
| Frozen-7Bbase | ✓ | 29.5 | - | 5.9 | - | - |
| Flamingo-9Bbase | ✓ | - | 51.8 | 44.7 | - | - |
| Flamingo-80Bbase | ✓ | - | 56.3 | 50.6 | - | - |
| Img2LLM-VQA-OPT13B | x | 57.1 | 57.3 | 39.9 | 33.3 | 33.0 |
| Img2LLM-VQA-OPT30B | x | 59.5 | 60.4 | 41.8 | 36.9 | 36.0 |
| Img2LLM-VQA-OPT66B | x | 59.9 | 60.3 | 43.2 | 38.7 | 38.2 |
| Img2LLM-VQA-OPT175B | x | 60.6 | 61.9 | 45.6 | 42.9 | 40.7 |