README.md 6.39 KB
Newer Older
dcuai's avatar
dcuai committed
1
# LLaMA
zhaoying1's avatar
zhaoying1 committed
2

yuguo's avatar
update  
yuguo committed
3
## 论文
zhaoying1's avatar
zhaoying1 committed
4

yuguo's avatar
update  
yuguo committed
5
6
7
`LLaMA: Open and Efficient Foundation Language Models`

- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
zhaoying1's avatar
zhaoying1 committed
8
9
10

## 模型结构

yuguo's avatar
update  
yuguo committed
11
LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。特别是,llama 13B在大多数基准测试中优于GPT-3 (175B), LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。
zhaoying1's avatar
zhaoying1 committed
12

chenzk's avatar
chenzk committed
13
<img src="http://developer.sourcefind.cn/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="llama模型结构.png" style="zoom:50%;" />
zhaoying1's avatar
zhaoying1 committed
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

以下是llama-13B的主要网络参数配置:

```
"hidden_act": "silu", 
"hidden_size": 5120, 
"intermediate_size": 13824, 
"initializer_range": 0.02, 
"max_sequence_length": 2048, 
"model_type": "llama", 
"num_attention_heads": 40, 
"num_hidden_layers": 40, 
"rms_norm_eps": 1e-06, 
"torch_dtype": "float16", 
"vocab_size": 32000
```

yuguo's avatar
update  
yuguo committed
31
32
## 算法原理

chenzk's avatar
chenzk committed
33
<img src="http://developer.sourcefind.cn/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86.png" alt="llama算法原理.png" style="zoom:50%;" />
“yuguo”'s avatar
“yuguo” committed
34

yuguo's avatar
update  
yuguo committed
35
36
37
38
39
40
41
42
以下是与原始 Transformer 架构的主要区别:

**预归一化**。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。

**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。

**旋转嵌入**。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。

zhaoying1's avatar
zhaoying1 committed
43
## 数据集
yuguo's avatar
update  
yuguo committed
44

zhaoying1's avatar
zhaoying1 committed
45
46
我们在Fastchat目录下集成了英文对话数据集供用户快速验证:

“yuguo”'s avatar
update  
“yuguo” committed
47
48
    $ tree ./FastChat-main/playground/data
      ── alpaca-data-conversation.json
zhaoying1's avatar
zhaoying1 committed
49

“yuguo”'s avatar
update  
“yuguo” committed
50
## 环境配置
zhaoying1's avatar
zhaoying1 committed
51

dcuai's avatar
dcuai committed
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
### Docker(方法一)
```
拉取镜像:
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
创建并启动容器:
docker run --shm-size 64g --network=host --name=llama_fastchat --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined  -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> -it <Your Image ID> bash

cp -r mpirun/* ./
cd FastChat-main
pip3 install -e .
cd ../transformers-main
pip3 install -e .
pip3 uninstall wandb
pip3 install mpi4py
cd ..
```

### Dockerfile(方法二)
```
dcuai's avatar
dcuai committed
71
cd llama_fastchat_pytorch
dcuai's avatar
dcuai committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
docker build --no-cache -t llama_fastchat:latest .
docker run --shm-size 64g --network=host --name=llama_fastchat --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined  -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> -it llama_fastchat:latest bash

cp -r mpirun/* ./
cd FastChat-main
pip3 install -e .
cd ../transformers-main
pip3 install -e .
pip3 uninstall wandb
pip3 install mpi4py
cd ..
```

### Anaconda(方法三)

chenzk's avatar
chenzk committed
87
环境变量参考dtk-24.04.1,python3.10环境正常,要求dtk环境正常。关于本项目DCU显卡所需torch库等均可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装:
dcuai's avatar
dcuai committed
88
89

1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
chenzk's avatar
chenzk committed
90
https://developer.sourcefind.cn/tool/
dcuai's avatar
dcuai committed
91
92
93
94
95
96
97
98
99
100

```
DTK驱动:dtk24.04.1
python:python3.10
torch:2.1.0
torchvision:0.16.0
apex:1.1
```

`Tips:以上DTK、python、torch等DCU相关工具包,版本需要严格一一对应`
zhaoying1's avatar
zhaoying1 committed
101

dcuai's avatar
dcuai committed
102
2、其它非特殊库安装:
zhaoying1's avatar
zhaoying1 committed
103
104
105
106
107
108
109
```
cp -r mpirun/* ./
cd FastChat-main
pip3 install -e .
cd ../transformers-main
pip3 install -e .
cd ..
“yuguo”'s avatar
update  
“yuguo” committed
110
pip3 uninstall wandb
zhaoying1's avatar
zhaoying1 committed
111
112
```

“yuguo”'s avatar
update  
“yuguo” committed
113
114
115
116
## 训练

权重链接

chenzk's avatar
chenzk committed
117
13B:[llama-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf)
“yuguo”'s avatar
update  
“yuguo” committed
118

chenzk's avatar
chenzk committed
119
7B:[llama-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
zhaoying1's avatar
zhaoying1 committed
120

dcuai's avatar
dcuai committed
121
按需更改mpi_single.sh中模型权重所在路径。
zhaoying1's avatar
zhaoying1 committed
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

并行配置采用zero3,使用fp16精度微调,如果想使能apex adamw_apex_fused优化器,更改./FastChat-main/fastchat/train/train.py:55行优化器改成adamw_apex_fused。deepspeed config.json如下:

```
{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps":16,
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": false,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : true
  }
}
```
dcuai's avatar
dcuai committed
150
151
<!--该训练脚本需要2节点,每节点8张DCU-Z100L-32G。
进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改mpi_job.sh中--mca btl_tcp_if_include enp97s0f1,enp97s0f1改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定,微调命令:-->
zhaoying1's avatar
zhaoying1 committed
152

dcuai's avatar
dcuai committed
153
运行命令:
zhaoying1's avatar
zhaoying1 committed
154
```
dcuai's avatar
dcuai committed
155
156
#注释mpi_single.sh中的source env.sh,根据环境修改hostfile
mpirun -np 8 --allow-run-as-root  --hostfile hostfile --bind-to none  mpi_single.sh 8
zhaoying1's avatar
zhaoying1 committed
157
158
```

“yuguo”'s avatar
update  
“yuguo” committed
159
160
如果单节点运行7B的模型出现oom,可以适当减少batch size。

“yuguo”'s avatar
update  
“yuguo” committed
161
## result
“yuguo”'s avatar
update  
“yuguo” committed
162
163
164
165
166
167
168
169
170
171
172
173
### input

```plaintext
>>>冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者
```

### output

```plaintext
>>>回答:避寒,当然是去海南呀!海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾,没有雾霾!
```

dcuai's avatar
dcuai committed
174
175
176
177
178
179
180
181
182
183
184
### 精度

训练数据:[./FastChat-main/playground/data/alpaca-data-conversation.json](链接)

使用的GPGPU:16张DCU-Z100L-32G。

模型精度(max_sequence_length: 2048):
| 卡数 | 分布式工具 | 收敛性 |
| :------: | :------: |:------: |
| 16 | deepspeed | total_loss: 0.62/150 steps |

yuguo's avatar
update  
yuguo committed
185
186
187
188
## 应用场景

### 算法类别

dcuai's avatar
dcuai committed
189
`对话问答`
yuguo's avatar
update  
yuguo committed
190
191
192

### 热点应用行业

“yuguo”'s avatar
update  
“yuguo” committed
193
`医疗,教育,科研,金融`
yuguo's avatar
update  
yuguo committed
194

wanglch's avatar
wanglch committed
195
196
197
## 预训练权重


zhaoying1's avatar
zhaoying1 committed
198
199
## 源码仓库及问题反馈

chenzk's avatar
chenzk committed
200
- https://developer.sourcefind.cn/codes/modelzoo/llama_fastchat_pytorch
zhaoying1's avatar
zhaoying1 committed
201

dcuai's avatar
dcuai committed
202
## 参考资料
zhaoying1's avatar
zhaoying1 committed
203

wanglch's avatar
wanglch committed
204
* https://hf-mirror.com/yahma/llama-7b-hf/tree/main
dcuai's avatar
dcuai committed
205
* https://github.com/lm-sys/FastChat