Yuan2_fastchat.md 16.1 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# 基于Fastchat实现源2.0的微调和推理部署

[FastChat](https://github.com/lm-sys/FastChat)是一个用于训练、部署和评估基于LLM(大型语言模型)的聊天机器人的开放平台。通过使用huggingface [transformers](https://github.com/huggingface/transformers)支持LLM基于deepspeed/fsdp的多节点多卡微调;下述介绍使用FastChat微调yuan2.0模型的流程。

## 准备微调环境

- docker pull nvcr.io/nvidia/pytorch:23.08-py3
- docker run -v HOST_WORK_PATH:/workspace/ --ipc=host  --gpus all -p host-port:container-port --shm-size='64g' -it  nvcr.io/nvidia/pytorch:23.08-py3 /bin/bash
- git  clone  https://github.com/lm-sys/FastChat.git
- cd  FastChat
- pip  config  set  global.index-url  https://pypi.tuna.tsinghua.edu.cn/simple
- pip install -e  ".[model_worker,webui,train]"
- pip install deepspeed “bitsandbytes>=0.39.0” “transformers==4.31.0” plotly openai

## 准备模型及数据
- 获取[yuan2.0](https://github.com/IEIT-Yuan/Yuan-2.0?tab=readme-ov-file#hugging-face%E7%89%88%E6%9C%AC-)  huggingface模型文件: 
- 准备数据:FastChat为聊天机器人训练及服务做支持,因此其所需要的标准数据集为多轮及单轮对话数据集。<br />(1)自定义数据集时,使用[fastchat](https://github.com/lm-sys/FastChat/blob/main/data/dummy_conversation.json)所要求的数据格式,如使用下述json格式的文件定义单论或多轮对话数据集。<br />(2)使用已有的指令数据集改造为单轮对话,可以使用alpaca-data[英文](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release)[中文](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/alpaca_data_zh_51k.json)数据集进行对应格式的改造。
<br />(3)使用开源的多轮对话数据集,如BELLE项目开源的用户与助手的多轮对话数据集[bella-0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)

```
#multi turns example
[
   [
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
      },
      {
        "from": "human",
        "value": "Have a nice day!"
      },
      {
        "from": "gpt",
        "value": "You too!"
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
      }
    ]
  },
]
# single turn example
[
  {
    "id": "1",
    "conversations": [
      {
        "from": "human",
        "value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:"
      },
      {
        "from": "gpt",
        "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
      }
    ]
  },
  {
    "id": "2",
    "conversations": [
      {
        "from": "human",
        "value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Response:"
      },
      {
        "from": "gpt",
        "value": "The three primary colors are red, blue, and yellow."
      }
    ]
  },
```
## 使用fastchat 中定制的yuan2.0训练脚本文件

```
在fastchat/train/train_mem.py脚本中
- from fastchat.train.train import train
+ from fastchat.train.train_yuan2 import train

在fastchat/train/train_lora.py脚本中
-from fastchat.train.train import (
-    DataArguments,
-    ModelArguments,
-    make_supervised_data_module,
-)

+from fastchat.train.train_yuan2 import (
+    DataArguments,
+    ModelArguments,
+    make_supervised_data_module,
+)

将fastchat/train/train_yuan2.py脚本中的special tokenizer复制到train_lora.py
+tokenizer.add_tokens(
+        [
+            "<eod>",
+            "<sep>",
+            "<pad>",
+           "<mask>",
+           "<predict>",
+           "<FIM_SUFFIX>",
+          "<FIM_PREFIX>",
+          "<FIM_MIDDLE>",
+          "<commit_before>",
+          "<commit_msg>",
+          "<commit_after>",
+          "<jupyter_start>",
+          "<jupyter_text>",
+          "<jupyter_code>",
+          "<jupyter_output>",
+          "<empty_output>",
+      ],
+      special_tokens=True,
+  )

```
> fastchat中添加的yuan2_template相关信息,以下内容无需修改,开发者如有特殊需求可调整或改变如下相关模板信息
```
#yuan template infomation

fastchat/conversation.py脚本,包含yuan2.0 chat定制的模板信息

# Yuan2.0 chat template
# source: https://huggingface.co/IEITYuan/Yuan2-2B-Janus-hf/blob/main/tokenizer_config.json#L6
register_conv_template(
    Conversation(
        name="yuan2",
        roles=("user", "assistant"),
        sep_style=SeparatorStyle.YUAN2,
        sep="<sep>",
        sep2="\n",
        stop_token_ids=[
            77185,
        ],  # "<eod>"
        stop_str="<eod>",
    )
)
fastchat/model/model_adapter.py脚本, 包含yuan2.0 chat模型及tokenizer加载时的函数

class Yuan2Adapter(BaseModelAdapter):
    """The model adapter for Yuan2.0"""

    def match(self, model_path: str):
        return "yuan2" in model_path.lower()

    def load_model(self, model_path: str, from_pretrained_kwargs: dict):
        revision = from_pretrained_kwargs.get("revision", "main")
        # from_pretrained_kwargs["torch_dtype"] = torch.bfloat16
        tokenizer = LlamaTokenizer.from_pretrained(
            model_path,
            add_eos_token=False,
            add_bos_token=False,
            eos_token='<eod>',
            eod_token='<eod>',
            sep_token='<sep>',
            revision = revision,
        )
        tokenizer.add_tokens(
            ['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>', '<commit_before>',
             '<commit_msg>', '<commit_after>', '<jupyter_start>', '<jupyter_text>', '<jupyter_code>',
             '<jupyter_output>', '<empty_output>'], special_tokens=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            # device_map='auto',
            trust_remote_code=True,
            **from_pretrained_kwargs
        )
        return model, tokenizer

    def get_default_conv_template(self, model_path: str) -> Conversation:
        return get_conv_template("yuan2")

fastchat/model/model_yuan2.py脚本,包含yuan2.0 chat模型生成内容时的默认设置

```
## 源2.0全量微调
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train_mem.py \
        --model_name_or_path  path-to-huggingface-models \
        --trust_remote_code True\
        --data_path ./data/alpaca_data_zh_conversion.json \
        --bf16 True \
        --output_dir ./test_yuan2b_full \
        --num_train_epochs 3 \
        --per_device_train_batch_size 4 \
        --per_device_eval_batch_size 1 \
        --gradient_accumulation_steps 4 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 1200 \
        --save_total_limit 10 \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --tf32 True \
        --model_max_length 1024 \
        --gradient_checkpointing True \
        --lazy_preprocess True \
        --deepspeed playground/zero2_ds_woloading.json \
        --efficient_loss False \
        --split_example_loss True \
        --last_response_loss False \
```
 
> <br />--model_max_length 可以指定微调时单个样本最大长度;<br /><br />--efficient_loss,--split_example_loss,--last_response_loss,代表了三种不同的面对多轮对话的loss计算方式。(1) efficient_loss代表计算聊天助手回答部分的loss;(2) last_response_loss代表只计算最后一轮聊天助手回答部分的loss;(3) split_example_loss代表将多轮对话拆分成多组样本,计算每组样本中最后一轮聊天助手内容部分的loss。选择时有且仅有一个为True,其余为False。 <br /><br />其余参数可以参考[fastchat源码](https://github.com/lm-sys/FastChat)及[transformers](https://github.com/huggingface/transformers)理解
- zero2 config文件参考
```
{
  "zero_optimization": {
     "stage": 2,
     "allgather_partitions": true,
     "allgather_bucket_size": 5e8,
     "reduce_scatter": true,
     "reduce_bucket_size": 5e8,
     "overlap_comm": false,
     "contiguous_gradients": true
  },
   "bf16": {
   "enabled": "auto",
   "loss_scale": 0,
   "initial_scale_power": 16,
   "loss_scale_window": 1000,
   "hysteresis": 2,
   "min_loss_scale": 1
   },
    "flops_profiler": {
    "enabled": true,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
  },
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true
}
```

## lora及Qlora高效微调
<br />对大模型进行全量微调是一件昂贵的事情,我们可以使用高效微调的方法,通过给大模型添加额外参数,对新添加的参数进行微调进而改进大模型性能,如[lora](https://arxiv.org/abs/2106.09685)[Qlora](https://arxiv.org/abs/2305.14314)高效微调方案。
<br />lora本质上是一种重参数化方法,通过在参数矩阵添加旁支,来微调大模型性能。lora通过只在部分权重矩阵上添加旁支,来降低计算量;通过只更新旁支矩阵的参数,降低显存占用及并行通信量。
<br />Qlora在lora的基础上将模型权重量化为4bit,并将scale参数再进行一次量化(double quant),以达到显存进一步节省的目的。需要注意的是Qlora相比于lora一般会添加更多的旁支矩阵,其并不能加速计算,反而会有效率上的损失。
<br />使用fastchat可以通过如下方式非常方便对yuan2.0模型进行基于lora和Qlora的高效微调。
- 使用train_lora.py脚本,torchrun --nproc_per_node=8 --master_port=XXXX fastchat/train/train_lora.py .....
- 使用--lora_target_modules指定模型添加的lora模块,可以指定"q_proj",  "k_proj",  "v_proj",  "o_proj",  "gate_proj", "up_proj",  "down_proj"中的一个或多个,默认使用"q_proj", "v_proj"
- 使用--lora_r指定lora矩阵的秩
- 当指定--q_lora (True or False)指定是否使用Qlora进行高效微调<br />
- 高效微调在进行多轮对话微调时loss计算方式与全量微调一致,可以使用yuan2.0定义的三种不同方式中的一种   
高效微调参考脚本如下:
```
CUDA_VISIBLE_DEVICES=0 python  fastchat/train/train_lora.py \
        --model_name_or_path  hf-to-yuan-path \
        --trust_remote_code True\
        --data_path ./data/alpaca-data-conversation.json \
        --bf16 True \
        --output_dir ./checkpoints_yuan2_2b_lora \
        --num_train_epochs 3 \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 1 \
        --gradient_accumulation_steps 16 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 1200 \
        --save_total_limit 10 \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --model_max_length 512 \
        --gradient_checkpointing True \
        --lazy_preprocess True \
        --q_lora True \
        --efficient_loss False \
        --split_example_loss True \
        --last_response_loss False \


```



## 微调实测数据参考
| 微调方案     |     序列长度      |    Model       |  精度:加载/计算    |   GPU | bs:micro/global |显存占用(1*GPU)|epoch耗时|
| ------------| ----------------- | -------------  | ------------------ | ------|---------------- | ------ | -----|
|  ds_zero2_full| 2048            |  Yuan-2 2B     | bf16/bf16          | 8*L20| 1/128           |16G     |1.68h|
|  ds_zero3_lora| 2048            |  Yuan-2 51B    | bf16/bf16          | 8*L20| 1/128           |43G     |23h|
|  ds_zero3_lora| 2048            |  Yuan-2 102B   | bf16/bf16          | 8*L20| 1/128           |45G     |47h|
|  ds_zero2_full| 1024            |  Yuan-2 2B     | bf16/bf16          | 8*L20| 1/128           |15G     |1.3h|
|  ds_zero3_lora| 1024            |  Yuan-2 51B    | bf16/bf16          | 8*L20| 1/128           |43G     |18h|
|  ds_zero3_lora| 1024            |  Yuan-2 102B   | bf16/bf16          | 8*L20| 1/128           |42G     |40h|
|  Qlora        | 1024            |  Yuan-2 2B     | int4/bf16          | 1*L20| 1/16            |4.5G    |3.4h|

>以上测试使用52K条alpaca-samples,改造为单轮对话数据;epoch耗时为微调单个epoch的时间

## 微调部署及使用

基于yuan2.0微调完成的chat模型,使用fastchat可以非常方便的进行服务部署及使用。

-  命令行方式
```angular2html
使用N个GPU部署chat模型
python3 -m fastchat.serve.cli --model PATH-TO_CHATMODELS --num-gpus N
```
- WebGUI方式
```angular2html
python3 -m fastchat.serve.controller  --host 0.0.0.0 &
python3 -m fastchat.serve.model_worker --model-path PATH-TO_CHATMODELS --host 0.0.0.0 &
#--gpus 0,1,2,3 --num-gpus 4 指定使用4个GPU加载模型进行推理
python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 映射的IP端口号
```
- OpenAI-Compatible RESTful APIs

关于安装`fastchat`及相关依赖,可以执行:
```shell
pip3 install "fschat[model_worker,webui]"
pip3 install transformers==4.36.2 einops==0.7.0 gradio==3.50.2 gradio_client==0.6.1 pydantic==1.10.13
```

在正确安装完`fastchat`之后,可以参考[fastchat openai api启动脚本](../examples/fastchat_openai_server_engine.sh), 修改脚本里HOST、PORT、MODEL_PATH等内容:
```shell
CONTROLLER_HOST="0.0.0.0"
CONTROLLER_PORT=8503

MODEL_WORKER_HOST="0.0.0.0"
MODEL_WORKER_PORT=8504

API_SERVER_HOST="0.0.0.0"
API_SERVER_PORT=8505

MODEL_PATH="/mnt/models/Yuan2-2B-Mars-hf/"
```
启动完毕后,验证:
```shell
# cURL或者浏览器访问 http://<api_server_host>:<api_server_port>/v1/models 确保结果中有一个类似的模型:
{
    "object": "list",
    "data": [
        {
            "id": "yuan2",
            "object": "model",
            "created": 1713955516,
            "owned_by": "fastchat",
            "root": "yuan2",
            "parent": null,
            "permission": [
                {
                    "id": "modelperm-KT7CstuH8yLHFWWiFzVpkd",
                    "object": "model_permission",
                    "created": 1713955516,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": true,
                    "allow_view": true,
                    "allow_fine_tuning": false,
                    "organization": "*",
                    "group": null,
                    "is_blocking": false
                }
            ]
        }
    ]
}
```
使用`openai`客户端调用:
```python
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://<api_server_host>:<api_server_port>/v1",
)

completion = client.chat.completions.create(
    model="yuan2",
    messages=[
        {"role": "system", "content": "你是一个私人助手,能帮我解决很多问题。"},
        {"role": "user", "content": "你好!"}
    ]
)

print(completion.choices[0].message)

# output
# ChatCompletionMessage(content='你好!很高兴为你提供帮助。请问有什么我可以为你做的吗?', role='assistant', function_call=None, tool_calls=None)
```
>我们可以在[langchain](https://github.com/langchain-ai/langchain)中[使用OpenAI-Compatible RESTful APIs完成基于LLM的应用构建。](https://github.com/lm-sys/FastChat/blob/main/docs/langchain_integration.md)