# Emu3.5_pytorch
## 论文
[Emu3.5](https://arxiv.org/pdf/2510.26583)

## 模型结构
Emu3.5在“Next-Token Prediction”范式的基础上，模拟人类自然学习方式，以自回归架构实现了对多模态序列的“Next-State Prediction (NSP)”，获得了可泛化的世界建模能力。

<div align=center>
    <img src="./doc/arch.png"/>
</div>


## 算法原理
Emu3.5 是由北京智源人工智能研究院发布的多模态世界大模型。它通过在超过10万亿的多模态Token（主要源自互联网视频，总时长约790年）上进行端到端预训练，具备了原生的世界建模能力，能够理解和生成文本、图像和视频等多种模态的数据。


## 环境配置
### 硬件需求
DCU型号：BW1000,节点数量：1台,卡数：2张。

`-v 路径`、`docker_name`和`imageID`根据实际情况修改

### Docker（方法一）
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.7.1-ubuntu22.04-dtk25.04.2-py3.10-alpha
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash
pip install http://10.16.4.1:8000/debug/torchaudio/dtk25.04.2-beta-bug-fix/torch251-audio/torch251-audio-fastpt/torchaudio-2.5.1a0%2Bd178b24-cp310-cp310-manylinux_2_28_x86_64.whl
pip install http://10.16.4.1:8000/debug/flash_attn/dtk25.04.2-rc1/dtk25.04-llvm0106/flash_attn-2.6.1%2Bdas.opt1.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install transformers -U
cd /your_code_path/emu3.5_pytorch
pip install -r requirements.txt
```

### Dockerfile（方法二）
```bash
cd docker
docker build --no-cache -t emu3.5_pytorch:latest .
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash
pip install http://10.16.4.1:8000/debug/torchaudio/dtk25.04.2-beta-bug-fix/torch251-audio/torch251-audio-fastpt/torchaudio-2.5.1a0%2Bd178b24-cp310-cp310-manylinux_2_28_x86_64.whl
pip install http://10.16.4.1:8000/debug/flash_attn/dtk25.04.2-rc1/dtk25.04-llvm0106/flash_attn-2.6.1%2Bdas.opt1.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install transformers -U
cd /your_code_path/emu3.5_pytorch
pip install -r requirements.txt
```

### Anaconda（方法三）
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04.2
python: 3.10
torch: 2.7.1a0+das.opt1.dtk25042
accelerate：1.11.0
transformers: 4.48.2
flash_attn：2.6.1+das.opt1.dtk2504
```
`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
```bash
pip install http://10.16.4.1:8000/debug/torchaudio/dtk25.04.2-beta-bug-fix/torch251-audio/torch251-audio-fastpt/torchaudio-2.5.1a0%2Bd178b24-cp310-cp310-manylinux_2_28_x86_64.whl
pip install http://10.16.4.1:8000/debug/flash_attn/dtk25.04.2-rc1/dtk25.04-llvm0106/flash_attn-2.6.1%2Bdas.opt1.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install transformers -U
cd /your_code_path/emu3.5_pytorch
pip install -r requirements.txt

```
## 数据集
无

## 训练
暂无

## 推理

样例模型：[Emu3.5](https://huggingface.co/BAAI/Emu3.5-Image)

不同任务推理命令如下：

```bash
# 🖼️ Text-to-Image (T2I) task
python inference.py --cfg configs/example_config_t2i.py

# 🔄 Any-to-Image (X2I) task
python inference.py --cfg configs/example_config_x2i.py

# 🎯 Visual Guidance task
python inference.py --cfg configs/example_config_visual_guidance.py

# 📖 Visual Narrative task
python inference.py --cfg configs/example_config_visual_narrative.py


# After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.
```
可视化Protobuf文件输出
```bash
python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir> [--video]

```
## result

```bash
Handling prompt: <|extra_203|>You are a helpful assistant for t2i task. USER: A lively comic-style illustration depicting two humorous cartoon dogs interacting near a freshly dug backyard hole surrounded by scattered dirt, garden tools, blooming flowers, and a wooden fence background. At the upper-left side, Dog One stands nervously near the messy hole, ears down and eyes wide open with an expression of concern. Its speech bubble is an oval shape, outlined neatly with smooth, slightly rounded corners, positioned clearly above Dog One's head. Inside, clearly readable playful handwritten-style text emphasizes the dog's worried tone, saying, "You sure the humans won't notice this giant hole here?". Toward the lower-right side, Dog Two sits calmly and confidently with a cheerful, carefree expression, wagging its tail gently. Its speech bubble is rectangular with softly rounded edges, placed slightly overlapping with Dog One's speech bubble to guide the reader naturally downward diagonally across the frame. Dog Two's friendly, humorous response appears in a whimsical italicized comic font, clearly stating, "Relax! We'll just blame it on the neighbor's cat again!". Each speech bubble creats the playful and engaging backyard scene. ASSISTANT: <|extra_100|>

```
<div align=center>
    <img src="./doc/result.png"/>
</div>

### 精度
DCU与GPU精度一致，推理框架：pytorch。

## 应用场景
### 算法类别
多模态

### 热点应用行业
制造,广媒,家居,教育

## 预训练权重
- [Emu3.5](https://huggingface.co/BAAI/Emu3.5/tree/main)
- [Emu3.5-Image](https://huggingface.co/BAAI/Emu3.5-Image/tree/main)
- [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main)
## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/emu3.5_pytorch

## 参考资料
- https://huggingface.co/BAAI/Emu3.5-Image