README.md

# BLIP3-o
BLIP3-o可用于多模态数据预标注，通过60k指令调优数据集BLIP3o-60k进行增强，全开源统一多模态模型，支持文本到图像生成、图像描述以及视觉问答在内的多种任务。

## 论文
`BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset`
- https://arxiv.org/pdf/2505.09568

## 模型结构
直接在Qwen 2.5 VL上构建图像生成模块，在8B模型中，我们冻结Qwen2.5-VL-7B-Instruct主干网络并训练扩散变压器，总共有14亿（1.4B）可训练参数，采用CLIP + 流匹配和顺序训练来开发先进的统一多模态模型BLIP3-o。
<div align=center>
    <img src="./doc/blip3-o.png"/>
</div>

## 算法原理
CLIP嵌入与流匹配损失相结合，CLIP特征比VAE特征产生更紧凑、语义更丰富的表示，从而提高了训练效率，流匹配被证明是对图像分布进行建模的更有效的训练目标，从而产生更大的样本多样性和更高的视觉质量。
<div align=center>
    <img src="./doc/algorithm.png"/>
</div>

## 环境配置
```
mv BLIP3o_pytorch BLIP3o # 去框架名后缀
```

### Docker（方法一）
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：6063b673703a
docker run -it --shm-size=64G -v $PWD/BLIP3o:/home/BLIP3o -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name blip3o <your IMAGE ID> bash

cd /home/BLIP3o
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
pip install whl/bitsandbytes-0.42.0+das.opt1.dtk2504-py3-none-any.whl # bitsandbytes==0.42
pip install whl/torchaudio-2.1.2+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.1.2

cd diffusers 
pip install -e . -i https://mirrors.aliyun.com/pypi/simple # diffusers==0.32.2

cd /home/BLIP3o
pip install -e . -i https://mirrors.aliyun.com/pypi/simple # blip3o==0.1.0
```
### Dockerfile（方法二）
```
cd /home/BLIP3o/docker
docker build --no-cache -t blip3o:latest .
docker run --shm-size=64G --name blip3o -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../BLIP3o:/home/BLIP3o -it blip3o bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。

cd /home/BLIP3o
pip install whl/bitsandbytes-0.42.0+das.opt1.dtk2504-py3-none-any.whl # bitsandbytes==0.42
pip install whl/torchaudio-2.1.2+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.1.2

cd diffusers 
pip install -e . -i https://mirrors.aliyun.com/pypi/simple # diffusers==0.32.2

cd /home/BLIP3o
pip install -e . -i https://mirrors.aliyun.com/pypi/simple # blip3o==0.1.0
```
### Anaconda（方法三）
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
- https://developer.sourcefind.cn/tool/
```
DTK驱动:dtk2504
python:python3.10
torch:2.4.1
torchvision:0.19.1
torchaudio:2.1.2
triton:3.0.0
vllm:0.6.2
flash-attn:2.6.1
deepspeed:0.14.2
apex:1.4.0
bitsandbytes:0.42
transformers:4.51.3
```

`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`

2、其它非特殊库参照requirements.txt安装
```
cd /home/BLIP3o
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
pip install whl/bitsandbytes-0.42.0+das.opt1.dtk2504-py3-none-any.whl # bitsandbytes==0.42
pip install whl/torchaudio-2.1.2+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.1.2

cd diffusers 
pip install -e . -i https://mirrors.aliyun.com/pypi/simple # diffusers==0.32.2

cd /home/BLIP3o
pip install -e . -i https://mirrors.aliyun.com/pypi/simple # blip3o==0.1.0
```

## 数据集
`无`

## 训练
无

## 推理
预训练权重目录结构：
```
/home/BLIP3o/
    |── BLIP3o-Model-8B
    |── Alpha-VLLM/Lumina-Next-SFT-diffusers
    |── black-forest-labs/FLUX.1-dev
    |── jiuhai/eva_clip_vision_tower
    |── Qwen/Qwen2.5-VL-7B-Instruct
    └── Qwen/Qwen2.5-VL-3B-Instruct
``` 

### 单机单卡
```
cd /home/BLIP3o
python inference.py BLIP3o-Model-8B # 论文作者的源码限制为单卡推理
```
更多资料可参考源项目中的[`README_origin`](./README_origin.md)。

## result
`输入: `
```
prompt = "A photo of cute cat"

```

`输出:`
```
'A photo of cute cat.png'
```
<div align=center>
    <img src="./doc/A photo of cute cat.png"/>
</div>

官方生成效果示例：
<div align=center>
    <img src="./doc/result.png"/>
</div>

### 精度
DCU与GPU精度一致，推理框架：pytorch。

## 应用场景
### 算法类别
`多模态`
### 热点应用行业
`制造,广媒,金融,能源,医疗,家居,教育`
## 预训练权重
HF下载地址为：[BLIP3o-Model-8B](https://huggingface.co/BLIP3o/BLIP3o-Model-8B)、[Alpha-VLLM/Lumina-Next-SFT-diffusers](https://huggingface.co/Alpha-VLLM/Lumina-Next-SFT-diffusers)、[black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)、[jiuhai/eva_clip_vision_tower](https://huggingface.co/jiuhai/eva_clip_vision_tower)、[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)、[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
## 源码仓库及问题反馈
- http://developer.sourcefind.cn/codes/modelzoo/BLIP3o_pytorch.git
## 参考资料
- https://github.com/JiuhaiChen/BLIP3o.git