blip.md 3.95 KB
Newer Older
Matt's avatar
Matt committed
1
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Younes Belkada's avatar
Younes Belkada committed
2
3
4
5
6
7
8
9
10

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Younes Belkada's avatar
Younes Belkada committed
15
16
17
18
19
20
21
22
-->

# BLIP

## Overview

The BLIP model was proposed in [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.

23
BLIP is a model that is able to perform various multi-modal tasks including:
Younes Belkada's avatar
Younes Belkada committed
24
25
26
27
28
29
30
31
32
- Visual Question Answering 
- Image-Text retrieval (Image-text matching)
- Image Captioning

The abstract from the paper is the following:

*Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. 
However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.*

33
![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif)
Younes Belkada's avatar
Younes Belkada committed
34
35
36
37

This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
The original code can be found [here](https://github.com/salesforce/BLIP).

38
39
40
41
## Resources

- [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) on how to fine-tune BLIP for image captioning on a custom dataset

Younes Belkada's avatar
Younes Belkada committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
## BlipConfig

[[autodoc]] BlipConfig
    - from_text_vision_configs

## BlipTextConfig

[[autodoc]] BlipTextConfig

## BlipVisionConfig

[[autodoc]] BlipVisionConfig

## BlipProcessor

[[autodoc]] BlipProcessor

## BlipImageProcessor

[[autodoc]] BlipImageProcessor
    - preprocess

64
65
66
<frameworkcontent>
<pt>

Younes Belkada's avatar
Younes Belkada committed
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
## BlipModel

[[autodoc]] BlipModel
    - forward
    - get_text_features
    - get_image_features

## BlipTextModel

[[autodoc]] BlipTextModel
    - forward

## BlipVisionModel

[[autodoc]] BlipVisionModel
    - forward

## BlipForConditionalGeneration

[[autodoc]] BlipForConditionalGeneration
    - forward

## BlipForImageTextRetrieval

[[autodoc]] BlipForImageTextRetrieval
    - forward

## BlipForQuestionAnswering

[[autodoc]] BlipForQuestionAnswering
Matt's avatar
Matt committed
97
98
    - forward

99
100
101
</pt>
<tf>

Matt's avatar
Matt committed
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
## TFBlipModel

[[autodoc]] TFBlipModel
    - call
    - get_text_features
    - get_image_features

## TFBlipTextModel

[[autodoc]] TFBlipTextModel
    - call

## TFBlipVisionModel

[[autodoc]] TFBlipVisionModel
    - call

## TFBlipForConditionalGeneration

[[autodoc]] TFBlipForConditionalGeneration
    - call

## TFBlipForImageTextRetrieval

[[autodoc]] TFBlipForImageTextRetrieval
    - call

## TFBlipForQuestionAnswering

[[autodoc]] TFBlipForQuestionAnswering
132
133
134
    - call
</tf>
</frameworkcontent>