README.md 5.18 KB
Newer Older
zwq330205812's avatar
zwq330205812 committed
1
# Granite-Speech_pytorch
zhangwq5's avatar
add  
zhangwq5 committed
2
3
4
## 论文
`Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities`
- https://arxiv.org/abs/2505.08699
zwq330205812's avatar
zwq330205812 committed
5

zhangwq5's avatar
add  
zhangwq5 committed
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## 模型结构
Granite-speech 采用三段式模块化架构,由一个 Conformer 声学编码器、一个 Q-former 多模态适配器和一个基于 LoRA 适配的 Granite 文本大语言模型(LLM)组成,实现了音频和文本处理路径的解耦与融合。

<div align=center>
    <img src="./doc/gs.png"/>
</div>

## 算法原理
Granite-speech 通过Q-former 适配器,将 Conformer 编码器提取的高维音频序列高效地降采样并投影到与文本嵌入相同的语义空间中,再利用 LoRA 技术对大语言模型进行轻量化微调,使其能够在不损害原有文本能力的前提下,理解并处理这些融合后的多模态声学特征。

<div align=center>
    <img src="./doc/qformer.png"/>
</div>

## 环境配置
### 硬件需求
DCU型号:K100_AI,节点数量:1台,卡数:1张。
### Docker(方法一)
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04-rc7-das1.5-py3.10-20250612-fixpy-rocblas0611-rc2

docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

zhangwq5's avatar
zhangwq5 committed
29
30
31
32
33
34
35
# 需要将transformers至少升级到4.53.1
pip install transformers==4.53.1
pip install librosa==0.11.0
# 需要安装torchaudio包
wget https://download.sourcefind.cn:65024/directlink/4/torchaudio/DAS1.6/torchaudio-2.4.1+das.opt1.dtk25041-cp310-cp310-manylinux_2_28_x86_64.whl
pip install torchaudio-2.4.1+das.opt1.dtk25041-cp310-cp310-manylinux_2_28_x86_64.whl

zhangwq5's avatar
add  
zhangwq5 committed
36
37
38
39
40
41
42
43
44
cd /your_code_path/granite-speech_pytorch
```
### Dockerfile(方法二)
此处提供dockerfile的使用方法
```bash
cd docker
docker build --no-cache -t granite-speech:latest .
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

zhangwq5's avatar
zhangwq5 committed
45
46
47
48
49
50
51
# 需要将transformers至少升级到4.53.1
pip install transformers==4.53.1
pip install librosa==0.11.0
# 需要安装torchaudio包
wget https://download.sourcefind.cn:65024/directlink/4/torchaudio/DAS1.6/torchaudio-2.4.1+das.opt1.dtk25041-cp310-cp310-manylinux_2_28_x86_64.whl
pip install torchaudio-2.4.1+das.opt1.dtk25041-cp310-cp310-manylinux_2_28_x86_64.whl

zhangwq5's avatar
add  
zhangwq5 committed
52
53
54
55
56
57
58
59
cd /your_code_path/granite-speech_pytorch
```
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04
python: 3.10
vllm: 0.8.5
zhangwq5's avatar
zhangwq5 committed
60
61
torch: 2.4.1+das.opt1.dtk25041
torchaudio: 2.4.1+das.opt1.dtk25041
zhangwq5's avatar
add  
zhangwq5 committed
62
```
zhangwq5's avatar
zhangwq5 committed
63
`Tips:以上dtk驱动、torch等DCU相关工具版本需要严格一一对应`
zhangwq5's avatar
add  
zhangwq5 committed
64
65
66

其它非深度学习库安装方式如下:
```bash
zhangwq5's avatar
zhangwq5 committed
67
68
pip install transformers==4.53.1
pip install librosa==0.11.0
zhangwq5's avatar
add  
zhangwq5 committed
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
```
## 数据集
暂无
## 训练
暂无
## 推理
### vllm推理方法
```bash
## 添加如下环境变量
export HF_ENDPOINT=https://hf-mirror.com
export LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torchaudio.libs:$LD_LIBRARY_PATH
## 模型地址参数
python ./infer/infer_vllm.py --model-type granite_speech --model_name /your_path/granite-speech-3.3-8b
```

## result
```
--- Prompt 1 ---
Generated Text: the first words i spoke in the original phonograph a little piece of practical poetry mary had a little lamb its fleece was white as snow and everywhere that mary went the lamb was sure to go

Logprobs per generated token:
  Step 0:
    - Generated Token: 1382 ('the')
    - Top Logprobs:
        - Rank 1: Token 1382 ('the') -> Logprob: -0.1331
        - Rank 2: Token 37711 ('these') -> Logprob: -3.5237
        - Rank 3: Token 31181 ('they') -> Logprob: -5.1253
        - Rank 4: Token 1772 ('my') -> Logprob: -5.1800
        - Rank 5: Token 292 ('he') -> Logprob: -5.4612
        - Rank 6: Token 2232 ('first') -> Logprob: -5.7268
        - Rank 7: Token 91 ('i') -> Logprob: -5.7503
        - Rank 8: Token 266 ('in') -> Logprob: -5.9378
        - Rank 9: Token 83 ('a') -> Logprob: -5.9378
        - Rank 10: Token 7020 ('here') -> Logprob: -6.0159
  Step 1:
    ...
    ...

成功将每个生成token的logprob写入到文件: ...
```

### 精度
```
# 分别在DCU和GPU上运行infer_vllm.py,得到各自的精度数据
python ./infer/calc_mae.py
```
结果
```
0.00040159359081176795
```

DCU与GPU精度一致,推理框架:vllm。
## 应用场景
### 算法类别
`语音对话`
### 热点应用行业
`金融,教育,政府,科研,制造,能源,交通`
## 预训练权重
zhangwq5's avatar
zhangwq5 committed
127
128
- [ibm-granite/granite-speech-3.3-8b](https://huggingface.co/ibm-granite/granite-speech-3.3-8b)
- [ibm-granite/granite-speech-3.3-2b](https://huggingface.co/ibm-granite/granite-speech-3.3-2b)
zhangwq5's avatar
add  
zhangwq5 committed
129
130
131
132
133

## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/granite-speech_pytorch
## 参考资料
- https://github.com/ibm-granite/granite-speech-models