README.md 4.67 KB
Newer Older
yuguo's avatar
update  
yuguo committed
1
# LLAMA
zhaoying1's avatar
zhaoying1 committed
2

yuguo's avatar
update  
yuguo committed
3
## 论文
zhaoying1's avatar
zhaoying1 committed
4

yuguo's avatar
update  
yuguo committed
5
6
7
`LLaMA: Open and Efficient Foundation Language Models`

- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
zhaoying1's avatar
zhaoying1 committed
8
9
10

## 模型结构

yuguo's avatar
update  
yuguo committed
11
LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。特别是,llama 13B在大多数基准测试中优于GPT-3 (175B), LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。
zhaoying1's avatar
zhaoying1 committed
12

yuguo's avatar
update  
yuguo committed
13
<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="llama模型结构.png" style="zoom:50%;" />
zhaoying1's avatar
zhaoying1 committed
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

以下是llama-13B的主要网络参数配置:

```
"hidden_act": "silu", 
"hidden_size": 5120, 
"intermediate_size": 13824, 
"initializer_range": 0.02, 
"max_sequence_length": 2048, 
"model_type": "llama", 
"num_attention_heads": 40, 
"num_hidden_layers": 40, 
"rms_norm_eps": 1e-06, 
"torch_dtype": "float16", 
"vocab_size": 32000
```

yuguo's avatar
update  
yuguo committed
31
32
## 算法原理

“yuguo”'s avatar
“yuguo” committed
33
<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86.png" alt="llama算法原理.png" style="zoom:50%;" />
“yuguo”'s avatar
“yuguo” committed
34

yuguo's avatar
update  
yuguo committed
35
36
37
38
39
40
41
42
以下是与原始 Transformer 架构的主要区别:

**预归一化**。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。

**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。

**旋转嵌入**。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。

zhaoying1's avatar
zhaoying1 committed
43
## 数据集
yuguo's avatar
update  
yuguo committed
44

zhaoying1's avatar
zhaoying1 committed
45
46
我们在Fastchat目录下集成了英文对话数据集供用户快速验证:

“yuguo”'s avatar
update  
“yuguo” committed
47
48
    $ tree ./FastChat-main/playground/data
      ── alpaca-data-conversation.json
zhaoying1's avatar
zhaoying1 committed
49

“yuguo”'s avatar
“yuguo” committed
50
## LLAMA-13B微调(使用mpi)
zhaoying1's avatar
zhaoying1 committed
51
52
53

### 环境配置

“yuguo”'s avatar
update  
“yuguo” committed
54
按照节点环境修改env.sh,环境变量参考dtk-22.10。修改2节点16卡Z00L裸金属节点,要求dtk环境正常,mpirun文件夹下包含预编译好的openmpi库mpi4.tar.gz,可直接使用。关于本项目DCU显卡所需torch库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装:
zhaoying1's avatar
zhaoying1 committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

```
cp -r mpirun/* ./
根据当前系统更改env.sh中相关路径
cd FastChat-main
pip3 install -e .
cd ../transformers-main
pip3 install -e .
cd ..
pip3 install  torch-1.10.0a0+git2040069.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
pip3 install  deepspeed-0.6.3+1b2721a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
pip3 install  apex-0.1+gitdb7007a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl(可选)
```

### 训练

“yuguo”'s avatar
update  
“yuguo” committed
71
该训练脚本需要2节点,每节点8张DCU-Z100L-32G。按需更改mpi_single.sh中模型权重所在路径。
zhaoying1's avatar
zhaoying1 committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100

并行配置采用zero3,使用fp16精度微调,如果想使能apex adamw_apex_fused优化器,更改./FastChat-main/fastchat/train/train.py:55行优化器改成adamw_apex_fused。deepspeed config.json如下:

```
{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps":16,
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": false,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : true
  }
}
```

“yuguo”'s avatar
update  
“yuguo” committed
101
进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改mpi_job.sh中--mca btl_tcp_if_include enp97s0f1,enp97s0f1改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定,微调命令:
zhaoying1's avatar
zhaoying1 committed
102
103
104
105
106

```
source mpi_job.sh
```

“yuguo”'s avatar
update  
“yuguo” committed
107
108
如果单节点运行7B的模型出现oom,可以适当减少batch size。

“yuguo”'s avatar
“yuguo” committed
109
## 精度
zhaoying1's avatar
zhaoying1 committed
110
111
112
113
114
115
116
117
118

训练数据:[./FastChat-main/playground/data/alpaca-data-conversation.json](链接)

使用的GPGPU:16张DCU-Z100L-32G。

模型精度(max_sequence_length: 2048):
| 卡数 | 分布式工具 | 收敛性 |
| :------: | :------: |:------: |
| 16 | deepspeed | total_loss: 0.62/150 steps |
yuguo's avatar
update  
yuguo committed
119
120
121
122
123
124
125
126
127
128
## 应用场景

### 算法类别

`自然语言处理`

### 热点应用行业

`nlp,智能聊天助手,科研`

zhaoying1's avatar
zhaoying1 committed
129
130
131
132
133
134
135
136
## 源码仓库及问题反馈

- https://developer.hpccube.com/codes/modelzoo/llama_torch

## 参考

* https://huggingface.co/decapoda-research/llama-13b-hf
* https://github.com/lm-sys/FastChat