README.md 4.66 KB
Newer Older
dcuai's avatar
dcuai committed
1
# Mini-Omni2
chenzk's avatar
v1.0  
chenzk committed
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
Mini-Omni2是功能上最接近GPT4o的多模态模型之一,Mini-Omni2是个视觉-音频助理,实时语音对话,能同时处理视觉、听觉和文本三种模态,针对用户视频和语音查询,实时提供端到端的语音响应。
## 论文
`Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex`
- https://arxiv.org/pdf/2410.11190
 
## 模型结构
Omni2的语言模型采用Qwen2-0.5B,adapter采用llama中常用的MLP,Encoder采用各种预训练效果较好模态的编码器,以便用很简单的方法就能训练出多模态模型。
<div align=center>
    <img src="./doc/omni2.png"/>
</div>

## 算法原理
Omni2的原理是以文本到文本的大语言模型能力为基础,吸收各种已经预训练好的不同模态的编码器的编码能力,添加一层MLP来适应多模态任务,通过一定的微调策略,在缺乏多模态配对训练数据的情况下也能取得一定的多模态效果,下图为其三阶段微调策略。
<div align=center>
    <img src="./doc/train.png"/>
</div>

## 环境配置
```
mv mini-omni2_pytorch mini-omni2 # 去框架名后缀
```

### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.2-py3.10
# <your IMAGE ID>为以上拉取的docker的镜像ID替换,本镜像为:83714c19d308
docker run -it --shm-size=64G -v $PWD/mini-omni2:/home/mini-omni2 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name omni2 <your IMAGE ID> bash
cd /home/mini-omni2
pip install -r requirements.txt # requirements.txt
# 安装ffmpeg
apt update
apt-get install ffmpeg
# 安装CLIP
cd CLIP
pip install . #clip==1.0
```
### Dockerfile(方法二)
```
cd mini-omni2/docker
docker build --no-cache -t omni2:latest .
docker run --shm-size=64G --name omni2 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../mini-omni2:/home/mini-omni2 -it omni2 bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt。
cd /home/mini-omni2
# 安装ffmpeg
apt update
apt-get install ffmpeg
# 安装CLIPclip-1.0
cd CLIP
pip install . #clip==1.0
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
chenzk's avatar
chenzk committed
54
- https://developer.sourcefind.cn/tool/
chenzk's avatar
v1.0  
chenzk committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
```
DTK驱动:dtk24.04.2
python:python3.10
torch:2.3.0
torchvision:0.18.1
torchaudio:2.1.2
triton:2.1.0
flash-attn:2.0.4
deepspeed:0.14.2
apex:1.3.0
xformers:0.0.25
```

`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`

2、其它非特殊库参照requirements.txt安装
```
cd mini-omni2
pip install -r requirements.txt # requirements.txt
# 安装ffmpeg
apt update
apt-get install ffmpeg
# 安装CLIP
cd CLIP
pip install . #clip==1.0
```

## 数据集


## 训练


## 推理
chenzk's avatar
v1.0.1  
chenzk committed
89
1、下载预训练权重`gpt-omni/mini-omni2`,将mini-omni2下所有文件放入checkpoint(新建)文件夹下;
chenzk's avatar
v1.0.4  
chenzk committed
90
91
92
93
```
#移动下载的权重到checkpoint下
mv gpt-omni/mini-omni2/* checkpoint/
```
chenzk's avatar
v1.0  
chenzk committed
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

2、下载预训练权重`hubertsiuzdak/snac_24khz`,将文件夹hubertsiuzdak放在根目录mini-omni2下;
```
cd mini-omni2
python inference_vision.py
```
更多资料可参考源项目的[`README_origin`](./README_origin.md)

## result
`输入: `
```
# 音色
input_audio_path = './data/samples/vision_qa_audio.wav'
# 图片
input_image_path = './data/samples/vision_qa_image.jpg'
```
`输出:`
```
# 文本
text output: The person in the image appears to be a middle-aged man with a fair complexion. He has short, neatly combed gray hair and a receding hairline. His facial features include a prominent nose, thin lips, and a gentle smile. He is wearing a dark suit with a notched lapel, a white shirt, and a dark tie with diagonal stripes. The background is a neutral, dark color that provides a contrast to his light-colored suit. The overall impression is one of professionalism and formality.
# 音频
'vision_qa_output.wav'
```

### 精度
DCU与GPU精度一致,推理框架:pytorch。

## 应用场景
### 算法类别
`对话问答`
### 热点应用行业
`制造,广媒,金融,能源,医疗,家居,教育`
## 预训练权重
chenzk's avatar
v1.0.4  
chenzk committed
127
Hugging Face下载地址为:[gpt-omni/mini-omni2](https://huggingface.co/gpt-omni/mini-omni2)[hubertsiuzdak/snac_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz)
chenzk's avatar
v1.0  
chenzk committed
128
129
130
131
132
133
## 源码仓库及问题反馈
- http://developer.sourcefind.cn/codes/modelzoo/mini-omni2_pytorch.git
## 参考资料
- https://github.com/gpt-omni/mini-omni2.git
- https://github.com/QwenLM/Qwen2.5.git