README.md

# BLIP-3

## 论文

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

https://arxiv.org/pdf/2408.08872

## 模型结构

BLIP-3，也叫xGen-MM，是一个用于开发Large的框架多模态模型（lmm）。该框架包括精心准备的数据集、训练配方、模型体系结构，以及最终的lmm套件。xGen-MM是xGen-MultiModal的缩写，扩展了Salesforce xGen计划的基础人工智能模型。模型经过一系列严格的评估的任务，包括单图像和多图像基准。预训练基础模型显示出很强的情境学习能力和指令微调模型在具有相似模型大小的开源lmm中展示了优异的竞争表现。此外，模型还引入了一个安全调优模型DPO，旨在减轻幻觉等有害行为，提高安全性。

## 环境配置

### Docker（方法一）

```
# 在光源可拉取docker镜像
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04.1-py3.10
# 创建并启动容器
docker run -it --network=host -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=80G --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --privileged=true --device=/dev/kfd --device=/dev/dri/ --ipc=host --group-add video --privileged --name <your_proiect_name> <image_id> bash
# 安装依赖包
python setup.py install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/ 
```

### Dockerfile（方法二）

```
docker build --no-cache -t blip3_pytorch:latest .
docker run -it --network=host --name=blip3_pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v /opt/hyhal/:/opt/hyhal/:ro -v /usr/local/hyhal:/usr/local/hyhal:ro blip3_pytorch:latest bash
安装依赖
python setup.py install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
```

### Anaconda（方法三）

```
1.创建conda虚拟环境
conda create -n blip3_pytorch python=3.10
2.关于本项目DCU显卡所需的工具包、深度学习库等均可从光合开发者社区下载安装：https://developer.hpccube.com/tool/
DTK驱动：dtk25.04.1
python：python3.10
torch:2.4.1
```
Tips：以上DTK、python、torch等DCU相关工具包，版本需要严格一一对应。
```
3.其它非特殊库参照requirements.txt安装
pip install -r requirements-training.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
python setup.py install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
```

# 训练

## 数据集
模型支持llava格式的json数据集文件，json文件结构如下

```
json文件：

{
 "id": "000000033471",
 "image": "coco/train2017/000000033471.jpg",
 "conversations": [
 {
 "from": "human",
 "value": "<image>\nWhat are the colors of the bus in the image?"
 },
 {
 "from": "gpt",
 "value": "The bus in the image is white and red."
 },
 ...
 ]
}
```

接着您需要配置[`data/example_data_config.yaml`](./data_configs/example_data_config.yaml)文件，包括所有json文件路径和图片数量，yaml文件中可以放置多个不同的数据集。如果您的json文件内是数据的相对路径，则还需要配置路径映射文件[`data/data_paths.py`](./data/data_paths.py)。
```
yaml文件：

data_path: {
  '/path/to/blip_laion_cc_sbu_558k.json': 558128
  '/path/to/som_qa_coco20k.json': 20160,
  '/path/to/som_listing_coco10k.json': 10000,
}
```

本项目训练使用的是SoM-LLaVA数据集，目录结构如下，下载可以通过Hugging Face[下载链接](https://api-inference.hf-mirror.com/datasets/zzxslp/SoM-LLaVA/tree/main)，也可以通过本项目提供的脚本down_dataset_hf.py进行下载。

```
/path/to/SoM-LLaVA/ 
     ├── som_listing_coco10k.json
     ├── som_llava_mix695k.json
     ├── som_qa_coco20k.json
     ├── som_train2017
     │   ├── 000000000001.jpg
     │   ├── 000000000009.jpg
     │   └── ...
```
## 微调

#### 预训练权重
运行如下脚本生成pytorch原生格式pt文件，Salesforce/xgen-mm-phi3-mini-base-r-v1.5、microsoft/Phi-3-mini-4k-instruct、google/siglip-so400m-patch14-384会自动从Hugging Face下载，脚本中已添加Hugging Face国内镜像

```
# 修改dest_fn参数为保存路径和pt文件名
python convert_hf_model.py
```
#### 单机多卡
```
bash scripts/example_finetune_xgenmmv1-phi3_mini_4k_instruct.sh
```
训练脚本参数说明如下
* `exp_name`: 训练日志文件名
* `data_path`: yaml文件路径
* `pretrained_ckpt`: pt文件路径
* `--nproc_per_node=2`: 多卡训练的卡数
* `--nnodes=1`: 节点数
* `--master_port 9650`: 端口
* `--lm_path`: 语言模型（LM）的路径，默认"microsoft/Phi-3-mini-4k-instruct"
* `--tokenizer_path`: 分词器的路径，用于处理文本数据，默认"microsoft/Phi-3-mini-4k-instruct"
* `--vision_encoder_path`: 视觉编码器，默认"google/siglip-so400m-patch14-384"

## result

### 应用场景

### 算法类别

图生文

### 热点应用行业

AIGC,设计

## 源码仓库及问题反馈

- https://developer.sourcefind.cn/codes/modelzoo/blip-3_pytorch
## 参考资料
- https://github.com/salesforce/LAVIS/tree/xgen-mm