**[2024.3.18] We have released our code and model.**
**[2024.6.7] We have released our paper. [Arxiv Link](https://arxiv.org/abs/2406.04292)**
**[2024.6.13] We have released [VISTA-S2 dataset](https://huggingface.co/datasets/JUNJIE99/VISTA_S2), a hybrid multi-modal dataset consisting of over 500,000 instances for multi-modal training (Stage-2 training in our paper).**
## Introduction
In this project, we introduce Visualized-BGE, a universal multi-modal embedding model. By incorporating image token embedding into the BGE Text Embedding framework, Visualized-BGE gains the flexibility to process multi-modal data that goes beyond just text. Visualized-BGE is mainly used for hybrid modal retrieval tasks, including but not limited to:
- Multi-Modal Knowledge Retrieval (query: text; candidate: image-text pairs, text, or image) e.g. [WebQA](https://github.com/WebQnA/WebQA)
We have generated a hybrid multi-modal dataset consisting of over 500,000 instances for multi-modal training (Stage-2 training in our paper). You can download our dataset from this [🤗 HF Link](https://huggingface.co/datasets/JUNJIE99/VISTA_S2).
Process the image compression package with the following commands:
```bash
cat images.tar.part*> images.tar
tar-xvf images.tar
```
If you obtain the following directory structure. You can then use the annotation information (json files) for your own training:
You don't need to install `xformer` and `apex`. They are not essential for inference and can often cause issues.
### Generate Embedding for Multi-Modal Data:
Visualized-BGE provides the versatility to encode multi-modal data in a variety of formats, whether it's purely text, solely image-based, or a combination of both.
> **Note:** Please download the model weight file ([bge-visualized-base-en-v1.5](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth?download=true), [bge-visualized-m3](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_m3.pth?download=true)) in advance and pass the path to the `model_weight` parameter.
- Composed Image Retrieval
``` python
####### Use Visualized BGE doing composed image retrieval
candi_emb_3=model.encode(text="The Mid-Hudson Bridge was designated as a New York State Historic Civil Engineering Landmark by the American Society of Civil Engineers in 1983. The bridge was renamed the \"Franklin Delano Roosevelt Mid-Hudson Bridge\" in 1994.")
Visualized BGE delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
#### Zero-shot Performance
- Statistical information of the zero-shot multi-modal retrieval benchmark datasets. During the zero-shot evaluation, we utilize the queries from the validation or test set of each dataset to perform retrieval assessments within the entire corpus of the respective dataset.

- Zero-shot evaluation results with Recall@5 on various hybrid multi-modal retrieval benchmarks. The -MM notation indicates baseline models that have undergone multi-modal training on our generated data.

#### Fine-tuning on Downstream Tasks
- Supervised fine-tuning performance on the WebQA dataset. All retrievals are performed on the entire deduplicated corpus.

- Supervised fine-tuning performance on the CIRR test set.

- Supervised fine-tuning performance on the ReMuQ test set.

## FAQ
**Q1: Can Visualized BGE be used for cross-modal retrieval (text to image)?**
A1: While it is technically possible, it's not the recommended use case. Our model focus on augmenting hybrid modal retrieval tasks with visual capabilities.
## Acknowledgement
The image token embedding model in this project is built upon the foundations laid by [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP).
## Citation
If you find this repository useful, please consider giving a star ⭐ and citation
```
@article{zhou2024vista,
title={VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval},
author={Zhou, Junjie and Liu, Zheng and Xiao, Shitao and Zhao, Bo and Xiong, Yongping},
ls_init_value:Optional[float]=None# layer scale initial value
patch_dropout:float=0.# what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results
global_average_pool:bool=False# whether to global average pool the last embedding layer, instead of using CLS token (https://arxiv.org/abs/2205.01580)
drop_path_rate:Optional[float]=None# drop path rate
timm_model_name:str=None# a valid model name overrides layers, width, patch_size
timm_model_pretrained:bool=False# use (imagenet) pretrained weights for named model
timm_pool:str='avg'# feature pooling for timm model ('abs_attn', 'rot_attn', 'avg', '')
timm_proj:str='linear'# linear projection for timm model output ('linear', 'mlp', '')
timm_proj_bias:bool=False# enable bias final projection
eva_model_name:str=None# a valid eva model name overrides layers, width, patch_size
qkv_bias:bool=True
fusedLN:bool=False
xattn:bool=False
postnorm:bool=False
rope:bool=False
pt_hw_seq_len:int=16# 224/14
intp_freq:bool=False
naiveswiglu:bool=False
subln:bool=False
@dataclass
classCLIPTextCfg:
context_length:int=77
vocab_size:int=49408
width:int=512
heads:int=8
layers:int=12
ls_init_value:Optional[float]=None# layer scale initial value
hf_model_name:str=None
hf_tokenizer_name:str=None
hf_model_pretrained:bool=True
proj:str='mlp'
pooler_type:str='mean_pooler'
masked_language_modeling:bool=False
fusedLN:bool=False
xattn:bool=False
attn_mask:bool=True
defget_cast_dtype(precision:str):
cast_dtype=None
ifprecision=='bf16':
cast_dtype=torch.bfloat16
elifprecision=='fp16':
cast_dtype=torch.float16
returncast_dtype
def_build_vision_tower(
embed_dim:int,
vision_cfg:CLIPVisionCfg,
quick_gelu:bool=False,
cast_dtype:Optional[torch.dtype]=None
):
ifisinstance(vision_cfg,dict):
vision_cfg=CLIPVisionCfg(**vision_cfg)
# OpenAI models are pretrained w/ QuickGELU but native nn.GELU is both faster and more
# memory efficient in recent PyTorch releases (>= 1.10).
# NOTE: timm models always use native GELU regardless of quick_gelu flag.