"examples/legacy/token-classification/run_tf_ner.py" did not exist on "dd9d483d03962fea127f59661f3ae6156e7a91d2"
README.md 10.3 KB
Newer Older
yuguo-Jack's avatar
yuguo-Jack committed
1
# LLAMA
yuguo960516yuguo's avatar
yuguo960516yuguo committed
2

yuguo-Jack's avatar
yuguo-Jack committed
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
## 论文

`LLaMA: Open and Efficient Foundation Language Models`

- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)

## 模型结构

LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。特别是,llama 13B在大多数基准测试中优于GPT-3 (175B), LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。

<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="llama模型结构.png" style="zoom:50%;" />

以下是llama-13B的主要网络参数配置:

```
{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 13824,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "rms_norm_eps": 1e-06,
  "use_recompute": false,
  "vocab_size": 32000
}
```

## 算法原理

<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86.png" alt="llama算法原理.png" style="zoom:50%;" />

以下是与原始 Transformer 架构的主要区别:

**预归一化**。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。

**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。

**旋转嵌入**。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。

## 数据集

yuguo-Jack's avatar
readme  
yuguo-Jack committed
53
54
### 增量预训练

yuguo-Jack's avatar
yuguo-Jack committed
55
56
57
58
59
60
61
62
63
64
65
66
67
数据详细制作流程可参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md),例:OpenWebText2预训练数据制作参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md)

为了方便用户运行测试本模型,本项目提供了处理好的100k条doc的训练样本:

    cd ./llm/llama/
    mkkdir data && cd data
    wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy
    wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz
    cd .. && tree data 
    data
    ├── llama_openwebtext_100k_ids.npy
    └── llama_openwebtext_100k_idx.npz

yuguo-Jack's avatar
readme  
yuguo-Jack committed
68
69
70
71
72
73
74
75
### SFT

```
cd ./llm/
wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz
tar -zxvf AdvertiseGen.tar.gz
```

yuguo-Jack's avatar
yuguo-Jack committed
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
## 环境配置

### Docker

推荐使用docker方式运行,提供拉取的docker镜像,关于本项目所需新版本 DTK 等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装,docker中默认使用dtk-23.04.1:

```
docker pull registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73

docker run -it --network=host --name=paddle_llama --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v `pwd`:/home registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73 /bin/bash

# 替换DTK-23.10

pip install paddlenlp==2.6.1 -i http://mirrors.aliyun.com/pypi/simple/ 
wget http://10.6.10.68:8000/customized/paddle/llama/paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl
pip3 install paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl
pip3 install tool_helpers visualdl==2.5.3 -i http://mirrors.aliyun.com/pypi/simple/ 
```

## 训练

yuguo-Jack's avatar
readme  
yuguo-Jack committed
97
98
### 增量预训练

yuguo-Jack's avatar
yuguo-Jack committed
99
100
101
102
103
104
105
106
权重链接

13B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b)

7B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b)

该训练脚本需要1节点,每节点8张DCU-Z100L-32G。

yuguo-Jack's avatar
readme  
yuguo-Jack committed
107
并行配置采用TP 8,PP 1,使用fp16精度预训练,配置如下:
yuguo-Jack's avatar
yuguo-Jack committed
108
109

```
yuguo-Jack's avatar
readme  
yuguo-Jack committed
110
111
112
113
114
115
--model_type "llama" \
--model_name_or_path "facebook/llama-13b" \
--tokenizer_name_or_path "facebook/llama-13b" \
--input_dir "./data" \
--output_dir "output/$task_name" \
--split 949,50,1 \
yuguo-Jack's avatar
yuguo-Jack committed
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
--max_seq_length 2048 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--per_device_eval_batch_size 2 \
--use_flash_attention 0 \
--use_fused_rms_norm 0 \
--fp16  \
--fp16_opt_level "O2"  \
--scale_loss 512 \
--tensor_parallel_degree 8 \
--learning_rate 0.00001 \
--min_learning_rate 0.000001 \
--max_steps 10000 \
--save_steps 5000 \
--weight_decay 0.01 \
--warmup_ratio 0.01 \
--max_grad_norm 1.0 \
--logging_steps 10 \
--dataloader_num_workers 1 \
--eval_steps 1000 \
--report_to "visualdl" \
--sharding "stage1" \
--disable_tqdm true \
--continue_training 1 \
--recompute 1 \
--recompute_granularity full \
--do_train \
--do_eval \
--device "gpu" \
--distributed_dataloader 1
```

yuguo-Jack's avatar
readme  
yuguo-Jack committed
148
增量预训练命令:
yuguo-Jack's avatar
yuguo-Jack committed
149
150
151
152
153
154

```
cd ./llm/llama/
bash run_trainer_tp8.sh
```

yuguo-Jack's avatar
readme  
yuguo-Jack committed
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
注意:

1. `continue_training` 表示从现有的预训练模型加载训练。7b,13b模型初始loss大概为1.9x, 随机初始化模型loss从11.x左右下降。
2. 多机训练时,若各机器使用的训练数据文件位置相同(例如挂载共享硬盘情况),请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据,
3. 若数据集文件夹中存在默认缓存文件夹`index-cache/`,则额外指定的`--data_cache`不生效,训练时优先加载默认缓存文件夹中的内容。

### SFT

权重链接

13B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b)

7B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b)

该训练脚本需要1节点,每节点8张DCU-Z100L-32G。

并行配置采用TP 8,PP 1,使用fp16精度微调,配置如下:

```
{
    "model_name_or_path": "facebook/llama-13b",
    "dataset_name_or_path": "./data",
    "output_dir": "./checkpoints/llama_sft_ckpts",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 4,
    "per_device_eval_batch_size": 4,
    "eval_accumulation_steps":16,
    "num_train_epochs": 3,
    "learning_rate": 3e-05,
    "warmup_steps": 30,
    "logging_steps": 1,
    "evaluation_strategy": "epoch",
    "save_strategy": "epoch",
    "src_length": 256,
    "max_length": 512,
    "fp16": true,
    "fp16_opt_level": "O2",
    "do_train": true,
    "do_eval": true,
    "disable_tqdm": true,
    "load_best_model_at_end": true,
    "eval_with_do_generation": false,
    "metric_for_best_model": "accuracy",
    "recompute": true,
    "save_total_limit": 1,
    "tensor_parallel_degree": 8
  }
```

SFT命令:

```
cd ./llm
python3 -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" finetune_generation.py ./llama/sft_tp_argument.json
```

yuguo-Jack's avatar
yuguo-Jack committed
211
212
## result

yuguo-Jack's avatar
readme  
yuguo-Jack committed
213
### 增量预训练精度
yuguo-Jack's avatar
yuguo-Jack committed
214
215
216
217
218
219
220
221
222

训练数据:[https://bj.bcebos.com/paddlenlp/models/transformers/llama/data](https://bj.bcebos.com/paddlenlp/models/transformers/llama/data)

使用的GPGPU:8张DCU-Z100L-32G。

模型精度(max_sequence_length: 2048):
| 卡数 | 分布式工具 | 收敛性 |
| :------: | :------: |:------: |
| 8 | Paddle |  |
yuguo-Jack's avatar
readme  
yuguo-Jack committed
223
### SFT精度
yuguo-Jack's avatar
yuguo-Jack committed
224

yuguo-Jack's avatar
readme  
yuguo-Jack committed
225
训练数据:[https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz](https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz)
yuguo-Jack's avatar
yuguo-Jack committed
226

yuguo-Jack's avatar
readme  
yuguo-Jack committed
227
使用的GPGPU:8张DCU-Z100L-32G。
yuguo-Jack's avatar
yuguo-Jack committed
228

yuguo-Jack's avatar
readme  
yuguo-Jack committed
229
230
模型精度(max_sequence_length: 512):

yuguo-Jack's avatar
readme  
yuguo-Jack committed
231
232
233
| 卡数 | 分布式工具 |                            收敛性                            |
| :--: | :--------: | :----------------------------------------------------------: |
|  8   |   Paddle   | train_loss 0.7左右波动 / 2 epoches,eval_loss: 1.03, eval_accuracy: 0.739, eval_ppl: 2.82 |
yuguo-Jack's avatar
yuguo-Jack committed
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256

## benchmark

### 训练benchmark

数据集使用[tatsu-lab/alpaca · Datasets at Hugging Face](https://huggingface.co/datasets/tatsu-lab/alpaca),将数据集放置在./examples/benchmark/peft/paddle下:

```
$tree tatsu-lab
tatsu-lab/
└── alpaca
    └── data
        └── train-00000-of-00001-a09b74b3ef9c3b56.parquet
```

训练benchmark测试命令:

```
cd ./examples/benchmark/peft/paddle

RCCL_NCHANNELS=8 HSA_FORCE_FINE_GRAIN_PCIE=1 python3 -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" benchmark.py --model_name_or_path facebook/llama-13b --english --train_data_size 1000  --intokens --intokens_length 1024  --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy no  --fp16 --fp16_opt_level O2 --recompute --tensor_parallel_degree 8 --logging_steps 50 --output_dir outputs
```

yuguo-Jack's avatar
readme  
yuguo-Jack committed
257
### 推理benchmark 1
yuguo-Jack's avatar
yuguo-Jack committed
258
259
260
261
262
263

```
cd ./examples/benchmark/peft/paddle
python3  inference_benchmark.py   --model_name_or_path facebook/llama-13b --dtype float16 --do_forward --do_generate
```

yuguo-Jack's avatar
readme  
yuguo-Jack committed
264
265
266
267
268
269
270
271
272
### 推理benchmark 2(换用[PaddleNLP-develop](https://github.com/PaddlePaddle/PaddleNLP/tree/28158b9735837495e6c73f848925e1d72b821863))

```
pip3 uninstall paddlenlp
cd ./llm
PYTHONPATH=../:$PYTHONPATH \
python3 predictor.py --model_name_or_path facebook/llama-13b --dtype float16 --src_length 300 --max_length 100 --output_file "infer.json" --batch_size 1 --benchmark
```

yuguo-Jack's avatar
yuguo-Jack committed
273
274
275
276
277
278
279
### LAMBADA推理评估

```
cd ./examples/benchmark/lambada
wget https://paddlenlp.bj.bcebos.com/data/benchmark/lambada_test.jsonl
```

yuguo-Jack's avatar
readme  
yuguo-Jack committed
280
评估LAMBADA数据集,运行以下脚本:
yuguo-Jack's avatar
yuguo-Jack committed
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307

```
python3 eval.py \
--model_name_or_path facebook/llama-13b \
--batch_size 4 \
--eval_path lambada_test.jsonl \
--tensor_parallel_degree 1 \
--cloze_eval
```

## 应用场景

### 算法类别

`自然语言处理`

### 热点应用行业

`医疗,教育,科研,金融`

## 源码仓库及问题反馈

- [https://developer.hpccube.com/codes/modelzoo/llama_paddle](https://developer.hpccube.com/codes/modelzoo/llama_paddle)

## 参考

* [https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)