README.md

# Granite-Speech_pytorch
## 论文
`Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities`
- https://arxiv.org/abs/2505.08699

## 模型结构
Granite-speech 采用三段式模块化架构，由一个 Conformer 声学编码器、一个 Q-former 多模态适配器和一个基于 LoRA 适配的 Granite 文本大语言模型（LLM）组成，实现了音频和文本处理路径的解耦与融合。

<div align=center>
    <img src="./doc/gs.png"/>
</div>

## 算法原理
Granite-speech 通过Q-former 适配器，将 Conformer 编码器提取的高维音频序列高效地降采样并投影到与文本嵌入相同的语义空间中，再利用 LoRA 技术对大语言模型进行轻量化微调，使其能够在不损害原有文本能力的前提下，理解并处理这些融合后的多模态声学特征。

<div align=center>
    <img src="./doc/qformer.png"/>
</div>

## 环境配置
### 硬件需求
DCU型号：K100_AI,节点数量：1台,卡数：1张。
### Docker（方法一）
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04-rc7-das1.5-py3.10-20250612-fixpy-rocblas0611-rc2

docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

cd /your_code_path/granite-speech_pytorch
pip install transformers>=4.53.1
```
### Dockerfile（方法二）
此处提供dockerfile的使用方法
```bash
cd docker
docker build --no-cache -t granite-speech:latest .
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

cd /your_code_path/granite-speech_pytorch
pip install transformers>=4.53.1
```
### Anaconda（方法三）
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04
python: 3.10
vllm: 0.8.5
torch: 2.4.1+das.opt2.dtk2504
deepspeed: 0.14.2+das.opt2.dtk2504
```
`Tips：以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`

其它非深度学习库安装方式如下：
```bash
pip install transformers>=4.53.1
```
## 数据集
暂无
## 训练
暂无
## 推理
### vllm推理方法
```bash
## 添加如下环境变量
export HF_ENDPOINT=https://hf-mirror.com
export LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torchaudio.libs:$LD_LIBRARY_PATH
## 模型地址参数
python ./infer/infer_vllm.py --model-type granite_speech --model_name /your_path/granite-speech-3.3-8b
```

## result
```
--- Prompt 1 ---
Generated Text: the first words i spoke in the original phonograph a little piece of practical poetry mary had a little lamb its fleece was white as snow and everywhere that mary went the lamb was sure to go

Logprobs per generated token:
  Step 0:
    - Generated Token: 1382 ('the')
    - Top Logprobs:
        - Rank 1: Token 1382 ('the') -> Logprob: -0.1331
        - Rank 2: Token 37711 ('these') -> Logprob: -3.5237
        - Rank 3: Token 31181 ('they') -> Logprob: -5.1253
        - Rank 4: Token 1772 ('my') -> Logprob: -5.1800
        - Rank 5: Token 292 ('he') -> Logprob: -5.4612
        - Rank 6: Token 2232 ('first') -> Logprob: -5.7268
        - Rank 7: Token 91 ('i') -> Logprob: -5.7503
        - Rank 8: Token 266 ('in') -> Logprob: -5.9378
        - Rank 9: Token 83 ('a') -> Logprob: -5.9378
        - Rank 10: Token 7020 ('here') -> Logprob: -6.0159
  Step 1:
    ...
    ...

成功将每个生成token的logprob写入到文件: ...
```

### 精度
```
# 分别在DCU和GPU上运行infer_vllm.py，得到各自的精度数据
python ./infer/calc_mae.py
```
结果
```
0.00040159359081176795
```

DCU与GPU精度一致，推理框架：vllm。
## 应用场景
### 算法类别
`语音对话`
### 热点应用行业
`金融,教育,政府,科研,制造,能源,交通`
## 预训练权重
- [ibm-granite/granite-speech-3.3-8b](https://hf-mirror.com/ibm-granite/granite-speech-3.3-8b)
- [ibm-granite/granite-speech-3.3-2b](https://hf-mirror.com/ibm-granite/granite-speech-3.3-2b)

## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/granite-speech_pytorch
## 参考资料
- https://github.com/ibm-granite/granite-speech-models