# Bagel

## 论文

`Emerging Properties in Unified Multimodal Pretraining`

* https://arxiv.org/abs/2505.14683

## 模型结构

BAGEL 采用了一种 MoT（Mixture of Transformers）架构，由两个 Transformer 专家组成：一个专注于多模态理解，另一个专注于多模态生成。相应地，模型使用了两个独立的视觉编码器：一个用于理解，一个用于生成。

![alt text](readme_imgs/arch.png)

## 算法原理

两个 Transformer 专家在每一层都通过共享的自注意力机制（self-attention）处理同一个 token 序列。在文本 token 的预测上，BAGEL 遵循 Next-Token-Prediction（下一个 token 预测）范式，延续了自回归语言模型的成熟优势。而在视觉 token 的预测上，BAGEL 采用了 Rectified Flow 方法。

![alt text](readme_imgs/alg.png)

## 环境配置

注意：该项目提供的安装包只适配该项目。

### Docker（方法一）
    
    docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10

    docker run --shm-size 100g --network=host --name=bagel --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash

    pip install -r requirements.txt

    pip install whl/torch-2.5.1+das.opt1.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl

    pip install whl/torchvision-0.20.1a0-04d8fc4-cp310-cp310-manylinux_2_28_x86_64.whl


### Dockerfile（方法二）

    docker build -t <IMAGE_NAME>:<TAG> .

    docker run --shm-size 100g --network=host --name=bagel --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
    
    pip install -r requirements.txt

    pip install whl/torch-2.5.1+das.opt1.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl

    pip install whl/torchvision-0.20.1a0-04d8fc4-cp310-cp310-manylinux_2_28_x86_64.whl

### Anaconda（方法三）

1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装： https://developer.sourcefind.cn/tool/

```
DTK驱动:dtk25.04
python:python3.10
triton:3.0
flash-attn:2.6.1
deepspeed:0.14.2
apex:1.4.0
```

2、其他非特殊库直接按照requirements.txt安装

```
pip install -r requirements.txt

pip install whl/torch-2.5.1+das.opt1.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl

pip install whl/torchvision-0.20.1a0-04d8fc4-cp310-cp310-manylinux_2_28_x86_64.whl
```

## 数据集

无

## 训练

无

## 推理

```bash
python inference_test.py
```

注意：在运行前需要修改相应参数，`model_path`。

## result

|任务|输入1|输入2|输出|
|:---:|:---|:---:|:---|
|t2i|A female cosplayer portraying an ethereal fairy or elf, wearing a flowing dress made of delicate fabrics in soft, mystical colors like emerald green and silver. She has pointed ears, a gentle, enchanting expression, and her outfit is adorned with sparkling jewels and intricate patterns. The background is a magical forest with glowing plants, mystical creatures, and a serene atmosphere||![](readme_imgs/r1.png)|
|editing|She boards a modern subway, quietly reading a folded newspaper, wearing the same clothes.|![](test_images/women.jpg)|![](readme_imgs/r2.png)|
|understand|Can someone explain what’s funny about this meme??|![](test_images/meme.jpg)|The humor in this meme comes from the exaggerated change in handwriting over the course of an exam. At the beginning, the handwriting is clear and legible, indicating a calm and focused start to the exam. However, as the exam progresses, the handwriting becomes increasingly difficult to read, suggesting that the student becomes more anxious and less able to write clearly. The final part of the meme shows an electrocardiogram (ECG) reading, which humorously implies that the student's heart rate is racing and their writing is becoming erratic, much like the fluctuations seen in an ECG. This visual pun highlights the common experience of feeling more nervous and less in control as an exam progresses.|


### 精度

DCU与GPU精度一致，推理框架：pytorch。

## 应用场景

### 算法类别

`多模态`

### 热点应用行业

`电商,教育,广媒`

## 预训练权重

[BAGEL-7B-MoT](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT)

## 源码仓库及问题反馈

* https://developer.sourcefind.cn/codes/modelzoo/bagel_pytorch

## 参考资料

* https://github.com/ByteDance-Seed/Bagel/tree/main