"git@developer.sourcefind.cn:OpenDAS/torchaudio.git" did not exist on "fc6090e96b37a528b051aa6747e6233354b5a3ef"
README.md 5.85 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# TinyLlama
只有1.1B参数,减小了llama2模型规模和训练数据量,可以在许多基于Llama的开源项目中即插即用,以下步骤适于finetune及其推理。
## 论文
`Llama 2: Open Foundation and Fine-Tuned Chat Models`
- https://arxiv.org/pdf/2307.09288.pdf

## 模型结构
llama2基于原始transformer decoder结构,输入处理阶段,Llama2 对文本进行分词,并将每个词转换为词向量表示,TinyLlama使用与Llama2相同的架构和分词器,特征提取阶段,Llama2 通过多组attention和全连接层结构FeedForward提取特征,最后,输出处理阶段,Llama2 采用全连接层结构MLP改变输入张量的形状获得生成结果,同时利用贪婪搜索等类似策略选取当前概率最高的词作为输出,为了进一步提供预测的准确率加入了强化学习RLHF进行监督,本文作者经过大量实验后提出:(data) quality is all you need!

<div align=center>
    <img src="./doc/bockbone.png"/>
</div>

## 算法原理
llama2算法主要将转换成向量的分词用qkv自相关和全连接层提取特征,然后利用全连接层输出监督训练结果并用搜索算法筛选出需要的目标,具体算法原理理解可参照下图原始transformer模型结构右侧decoder部分,Llama2作者在原transformer基础上加入了三个创新点减小计算量并提升精度:RMSNorm、SwiGLU、RoPE。
<div align=center>
    <img src="./doc/transformer.png"/>
</div>

## 环境配置
```
chenzk's avatar
v1.0.2  
chenzk committed
22
mv tinyllama_pytorch TinyLlama # 去框架名后缀
chenzk's avatar
v1.0  
chenzk committed
23
24
25
26
```

### Docker(方法一)
```
dcuai's avatar
dcuai committed
27
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
chenzk's avatar
v1.0  
chenzk committed
28
29
# <your IMAGE ID>为以上拉取的docker的镜像ID替换,本镜像为:ffa1f63239fc
docker run -it --shm-size=32G -v $PWD/TinyLlama:/home/TinyLlama -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name tinyllama <your IMAGE ID> bash
chenzk's avatar
v1.0.2  
chenzk committed
30
cd /home/TinyLlama
chenzk's avatar
v1.0  
chenzk committed
31
32
33
34
35
36
37
38
39
40
41
pip install -r requirements.txt
```
### Dockerfile(方法二)
```
cd TinyLlama/docker
docker build --no-cache -t tinyllama:latest .
docker run --shm-size=32G --name tinyllama -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../TinyLlama:/home/TinyLlama -it tinyllama bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt。
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
chenzk's avatar
chenzk committed
42
- https://developer.sourcefind.cn/tool/
chenzk's avatar
v1.0  
chenzk committed
43
```
dcuai's avatar
dcuai committed
44
45
DTK驱动:dtk24.04.1
python:python3.10
chenzk's avatar
v1.0  
chenzk committed
46
47
48
torch:2.1.0
torchvision:0.16.0
triton:2.1.0
dcuai's avatar
dcuai committed
49
50
51
52
53
54
apex:1.1.0
flash_attn:2.0.4
xformers:0.0.25
rotary-emb:0.1
dropout-layer-norm:0.1
xentropy-cuda-lib:0.1
chenzk's avatar
v1.0  
chenzk committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
```

`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`

2、其它非特殊库参照requirements.txt安装
```
pip install -r requirements.txt
```

若finetune时遇到bitsandbytes调用失败的bug,升级系统环境的libstdc++进行解决:
```
wget http://www.vuln.cn/wp-content/uploads/2019/08/libstdc.so_.6.0.26.zip
unzip libstdc.so_.6.0.26.zip
cp libstdc++.so.6.0.26 /usr/lib64
rm -rf /lib64/libstdc++.so.6
ln -s /lib64/libstdc++.so.6.0.26 /lib64/libstdc++.so.6
```

dcuai's avatar
dcuai committed
73

chenzk's avatar
v1.0  
chenzk committed
74
75
76

## 数据集
`openassistant-guanaco`
mashun1's avatar
update  
mashun1 committed
77

chenzk's avatar
chenzk committed
78
[huggingface](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/tree/main)
mashun1's avatar
update  
mashun1 committed
79

chenzk's avatar
v1.0  
chenzk committed
80
81
82
83
84
85
86
87
88
89

项目中已提供用于finetune的迷你数据集,数据目录结构如下:
```
timdettmers/
├── openassistant_best_replies_train.jsonl
└── openassistant_best_replies_eval.jsonl
```
官网提供的从头训练的数据集如下,完整数据集的预处理参照[`PRETRAIN.md`](./PRETRAIN.md)

`SlimPajama-627B`
mashun1's avatar
update  
mashun1 committed
90

chenzk's avatar
chenzk committed
91
[huggingface](https://huggingface.co/datasets/cerebras/SlimPajama-627B)
chenzk's avatar
v1.0  
chenzk committed
92
93

`starcoderdata`
mashun1's avatar
update  
mashun1 committed
94

chenzk's avatar
chenzk committed
95
[huggingface](https://huggingface.co/datasets/bigcode/starcoderdata)
chenzk's avatar
v1.0  
chenzk committed
96
97
98
99
100

`更多资料可参考源项目的README_origin.md`
## 训练
### 单机多卡(finetune)
```
mashun1's avatar
update  
mashun1 committed
101
# finetune所需预训练权重参考`预训练权重部分`
chenzk's avatar
v1.0  
chenzk committed
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# 本步骤说明的预训练权重采用503b,请下载后放入目录PY007下面:PY007/TinyLlama-1.1B-intermediate-step-240k-503b
cd TinyLlama
sh sft/script.sh # 全参数finetune
# 启动训练的过程中wandb: Enter your choice:3
```
若希望从头训练,请参照[`PRETRAIN.md`](./PRETRAIN.md)中的训练命令。

## 推理
```
python sft/infer.py
# 若采用官方默认权重推理:代码里设置model="PY007/TinyLlama-1.1B-intermediate-step-240k-503b"
```

## result
```
#问题
Human: Do you support the Biden or Sanders campaign for President? 
#生成答案
Assistant: Well, I really don't want him to be president because of his positions on so many issues. But I do agree with Sanders that the US needs a change. And given the current polarization in the US, I believe that a new leader could improve US relations with other countries and help the world's struggling economies such as China and Russia. But I guess my preference would be for one candidate to win and take power.
```
### 精度
测试数据:[`openassistant-guanaco`](./timdettmers/openassistant-guanaco/openassistant_best_replies_eval.jsonl),推理框架:pytorch。

|  device   |  train_loss  | eval_loss |
|:---------:|:----:|:----:|
| DCU Z100SM | 1.7787 | 1.8038 |
| GPU V100S  | 1.7787 | 1.8036 |

## 应用场景
### 算法类别
`对话问答`
### 热点应用行业
`制造,广媒,金融,能源,医疗,家居,教育`
mashun1's avatar
update  
mashun1 committed
135
136
137

## 预训练权重

chenzk's avatar
chenzk committed
138
[huggingface](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-240k-503b) 
mashun1's avatar
update  
mashun1 committed
139

chenzk's avatar
v1.0  
chenzk committed
140
## 源码仓库及问题反馈
chenzk's avatar
chenzk committed
141
- https://developer.sourcefind.cn/codes/modelzoo/tinyllama_pytorch
chenzk's avatar
v1.0  
chenzk committed
142
143
144
145
## 参考资料
- https://github.com/jzhang38/TinyLlama.git
- https://hf-mirror.com/ #Huggingface镜像官网下载教程
- https://hf-mirror.com/datasets #Huggingface镜像数据地址