README.md 5.34 KB
Newer Older
yuguo's avatar
update  
yuguo committed
1
# LLAMA
zhaoying1's avatar
zhaoying1 committed
2

yuguo's avatar
update  
yuguo committed
3
## 论文
zhaoying1's avatar
zhaoying1 committed
4

yuguo's avatar
update  
yuguo committed
5
6
7
`LLaMA: Open and Efficient Foundation Language Models`

- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
zhaoying1's avatar
zhaoying1 committed
8
9
10

## 模型结构

yuguo's avatar
update  
yuguo committed
11
LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。特别是,llama 13B在大多数基准测试中优于GPT-3 (175B), LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。
zhaoying1's avatar
zhaoying1 committed
12

yuguo's avatar
update  
yuguo committed
13
<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="llama模型结构.png" style="zoom:50%;" />
zhaoying1's avatar
zhaoying1 committed
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

以下是llama-13B的主要网络参数配置:

```
"hidden_act": "silu", 
"hidden_size": 5120, 
"intermediate_size": 13824, 
"initializer_range": 0.02, 
"max_sequence_length": 2048, 
"model_type": "llama", 
"num_attention_heads": 40, 
"num_hidden_layers": 40, 
"rms_norm_eps": 1e-06, 
"torch_dtype": "float16", 
"vocab_size": 32000
```

yuguo's avatar
update  
yuguo committed
31
32
## 算法原理

“yuguo”'s avatar
“yuguo” committed
33
<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86.png" alt="llama算法原理.png" style="zoom:50%;" />
“yuguo”'s avatar
“yuguo” committed
34

yuguo's avatar
update  
yuguo committed
35
36
37
38
39
40
41
42
以下是与原始 Transformer 架构的主要区别:

**预归一化**。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。

**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。

**旋转嵌入**。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。

zhaoying1's avatar
zhaoying1 committed
43
## 数据集
yuguo's avatar
update  
yuguo committed
44

zhaoying1's avatar
zhaoying1 committed
45
46
我们在Fastchat目录下集成了英文对话数据集供用户快速验证:

“yuguo”'s avatar
update  
“yuguo” committed
47
48
    $ tree ./FastChat-main/playground/data
      ── alpaca-data-conversation.json
zhaoying1's avatar
zhaoying1 committed
49

“yuguo”'s avatar
update  
“yuguo” committed
50
## 环境配置
zhaoying1's avatar
zhaoying1 committed
51

“yuguo”'s avatar
update  
“yuguo” committed
52
由于多节点环境配置差异较大,因此可按照节点环境修改env.sh,环境变量参考dtk-22.10,python3.8环境正常,网口正常。使用2个8卡Z00L裸金属节点,要求dtk环境正常,mpirun文件夹下包含预编译好的openmpi库mpi4.tar.gz,可直接使用。关于本项目DCU显卡所需torch库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装:
zhaoying1's avatar
zhaoying1 committed
53
54
55
56
57
58
59
60
61
62
63
64

```
cp -r mpirun/* ./
根据当前系统更改env.sh中相关路径
cd FastChat-main
pip3 install -e .
cd ../transformers-main
pip3 install -e .
cd ..
pip3 install  torch-1.10.0a0+git2040069.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
pip3 install  deepspeed-0.6.3+1b2721a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
pip3 install  apex-0.1+gitdb7007a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl(可选)
“yuguo”'s avatar
update  
“yuguo” committed
65
pip3 uninstall wandb
zhaoying1's avatar
zhaoying1 committed
66
67
```

“yuguo”'s avatar
update  
“yuguo” committed
68
69
70
71
## 训练

权重链接

yuguo-Jack's avatar
yuguo-Jack committed
72
13B:[decapoda-research/llama-13b-hf · Hugging Face](https://huggingface.co/srikanthmalla/decapoda-research-llama-13b-hf)
“yuguo”'s avatar
update  
“yuguo” committed
73

yuguo-Jack's avatar
yuguo-Jack committed
74
7B:[decapoda-research/llama-7b-hf · Hugging Face](https://huggingface.co/srikanthmalla/decapoda-research-llama-7b-hf)
zhaoying1's avatar
zhaoying1 committed
75

“yuguo”'s avatar
update  
“yuguo” committed
76
该训练脚本需要2节点,每节点8张DCU-Z100L-32G。按需更改mpi_single.sh中模型权重所在路径。
zhaoying1's avatar
zhaoying1 committed
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

并行配置采用zero3,使用fp16精度微调,如果想使能apex adamw_apex_fused优化器,更改./FastChat-main/fastchat/train/train.py:55行优化器改成adamw_apex_fused。deepspeed config.json如下:

```
{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps":16,
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": false,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : true
  }
}
```

“yuguo”'s avatar
update  
“yuguo” committed
106
进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改mpi_job.sh中--mca btl_tcp_if_include enp97s0f1,enp97s0f1改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定,微调命令:
zhaoying1's avatar
zhaoying1 committed
107
108

```
“yuguo”'s avatar
update  
“yuguo” committed
109
bash mpi_job.sh
zhaoying1's avatar
zhaoying1 committed
110
111
```

“yuguo”'s avatar
update  
“yuguo” committed
112
113
如果单节点运行7B的模型出现oom,可以适当减少batch size。

“yuguo”'s avatar
update  
“yuguo” committed
114
## result
“yuguo”'s avatar
update  
“yuguo” committed
115
116
117
118
119
120
121
122
123
124
125
126
### input

```plaintext
>>>冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者
```

### output

```plaintext
>>>回答:避寒,当然是去海南呀!海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾,没有雾霾!
```

dcuai's avatar
dcuai committed
127
128
129
130
131
132
133
134
135
136
137
### 精度

训练数据:[./FastChat-main/playground/data/alpaca-data-conversation.json](链接)

使用的GPGPU:16张DCU-Z100L-32G。

模型精度(max_sequence_length: 2048):
| 卡数 | 分布式工具 | 收敛性 |
| :------: | :------: |:------: |
| 16 | deepspeed | total_loss: 0.62/150 steps |

yuguo's avatar
update  
yuguo committed
138
139
140
141
## 应用场景

### 算法类别

dcuai's avatar
dcuai committed
142
`对话问答`
yuguo's avatar
update  
yuguo committed
143
144
145

### 热点应用行业

“yuguo”'s avatar
update  
“yuguo” committed
146
`医疗,教育,科研,金融`
yuguo's avatar
update  
yuguo committed
147

zhaoying1's avatar
zhaoying1 committed
148
149
## 源码仓库及问题反馈

“yuguo”'s avatar
update  
“yuguo” committed
150
- https://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch
zhaoying1's avatar
zhaoying1 committed
151
152
153
154

## 参考

* https://huggingface.co/decapoda-research/llama-13b-hf
dcuai's avatar
dcuai committed
155
* https://github.com/lm-sys/FastChat