README.md 17.1 KB
Newer Older
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
1
2
# Examples

3
4
5
6
7
8
## Table of Contents

- [Examples](#examples)
  - [Table of Contents](#table-of-contents)
  - [Install requirements](#install-requirements)
  - [Supervised datasets collection](#supervised-datasets-collection)
9
    - [Conversation dataset generation](#conversation-dataset-generation)
10
11
12
13
14
15
16
17
18
19
  - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
    - [Arg List](#arg-list)
  - [Stage2 - Training reward model](#stage2---training-reward-model)
    - [Features and tricks in RM training](#features-and-tricks-in-rm-training)
    - [Experiment result](#experiment-result)
    - [Arg List](#arg-list-1)
  - [Stage3 - Training model using prompts with RL](#stage3---training-model-using-prompts-with-rl)
    - [Arg List](#arg-list-2)
  - [Inference example - After Stage3](#inference-example---after-stage3)
  - [Attention](#attention)
20
    - [data](#data)
21
22
23
24
25
26
27
28
29
30
31
  - [Support Model](#support-model)
    - [GPT](#gpt)
    - [BLOOM](#bloom)
    - [OPT](#opt)
    - [LLaMA](#llama)
  - [Add your own models](#add-your-own-models)
    - [Actor model](#actor-model)
    - [Reward model](#reward-model)
    - [Critic model](#critic-model)

---
32

Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
33
34
35
36
37
38
## Install requirements

```shell
pip install -r requirements.txt
```

BlueRum's avatar
BlueRum committed
39
40
## Supervised datasets collection

41
42
43
44
We collected 104K bilingual datasets of Chinese and English, and you can find the datasets in this repo
[InstructionWild](https://github.com/XueFuzhao/InstructionWild) and in this [file](https://github.com/XueFuzhao/InstructionWild/blob/main/data/README.md).

Here is how we collected the data
BlueRum's avatar
BlueRum committed
45
46
47
48
49

<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width=500/>
</p>

50
51
52
53
54
55
### Conversation dataset generation

In order to further improve the model's ability to handle multi-turn conversations, we need to include samples with multi-turn conversations in the dataset. However, the samples in InstructWild and Alpaca datasets currently consist of only single-turn conversations, and their dataset organization is not suitable for storing multi-turn conversations. Additionally, after converting the aforementioned datasets, we also need to include multi-turn conversation datasets like ShareGPT, and we should transform them into the training format supported by ColossalChat.

A sample of conversation dataset should have the following fields:

56
57
58
59
60
- `type` (str, optional): The type of the data sample.
- `language` (str, optional): The language of the data sample.
- `dataset` (str, optional): The dataset the data sample originates from.
- `conversations` (str, compulsory): Conversation content of the data sample.
- `id` (int, optional): The ID of the data sample.
61
62

A simple example:
63

64
65
```json
{
66
67
68
69
70
71
72
73
74
75
76
77
78
79
  "type": "instruction",
  "language": "English",
  "dataset": "Alpaca",
  "conversations": [
    {
      "from": "human",
      "value": "Give three tips for staying healthy."
    },
    {
      "from": "gpt",
      "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
    }
  ],
  "id": 1
80
81
82
}
```

83
> **NOTE:** Only key `conversations` is compulsary for training and other keys serve as metadata. The length of `conversations` varies.
84
85
86
87

You can run the `examples/generate_conversation_dataset.py` to generate a conversation dataset supported by ColossalChat.

You can use the following cmd to generate conversation dataset.
88
89

```bash
90
91
92
93
94
python generate_conversation_dataset.py \
    --dataset "All"
    --save_path "/path/to/dataset"
```

BlueRum's avatar
BlueRum committed
95
96
97
## Stage1 - Supervised instructs tuning

Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.
98
[[Stage1 tutorial video]](https://www.youtube.com/watch?v=-qFBZFmOJfg)
BlueRum's avatar
BlueRum committed
99
100
101
102

You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.

You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
103
104

```bash
BlueRum's avatar
BlueRum committed
105
106
107
108
109
110
111
torchrun --standalone --nproc_per_node=4 train_sft.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --save_path  /path/to/Coati-7B \
    --dataset /path/to/data.json \
    --batch_size 4 \
112
    --accumulation_steps 8 \
BlueRum's avatar
BlueRum committed
113
114
115
    --lr 2e-5 \
    --max_datasets_size 512 \
    --max_epochs 1 \
116
    --grad_checkpoint
BlueRum's avatar
BlueRum committed
117
```
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132

**Note**: the supervised dataset follows the following format,

```json
[
    {
        "instruction": "Provide a list of the top 10 most popular mobile games in Asia",
        "input": "",
        "output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
        "id": 0
    },
    ...
]
```

BlueRum's avatar
BlueRum committed
133
### Arg List
134
135
136
137
138
139
140
141
142
143
144

- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- `--model`: model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- `--pretrain`: pretrain model, type=str, default=None
- `--max_datasets_size`: the max size of dataset, type=int, default=None
- `--save_path`: path to save the model, type=str, default='output'
- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
- `--max_epochs`: max epochs for training, type=int, default=3
- `--batch_size`: batch size while training, type=int, default=4
- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
- `--grad_checkpoint`: enable gradient checkpointing, type=bool, default=False
BlueRum's avatar
BlueRum committed
145
146
147
148

## Stage2 - Training reward model

We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.
149
[[Stage2 tutorial video]](https://www.youtube.com/watch?v=gMx2CApKhuo)
BlueRum's avatar
BlueRum committed
150
151

You can run the `examples/train_rm.sh` to start a reward model training.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
152

BlueRum's avatar
BlueRum committed
153
You can also use the following cmd to start training a reward model.
154
155

```bash
kingkingofall's avatar
kingkingofall committed
156
torchrun --standalone --nproc_per_node=4 train_reward_model.py \
BlueRum's avatar
BlueRum committed
157
158
159
160
161
162
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --loss_fn 'log_exp'\
    --save_path 'rmstatic.pt' \
```
163

Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
164
### Features and tricks in RM training
165

Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
166
- We support [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
167
168
- We support 2 kinds of loss function named `log_sig`(used by OpenAI) and `log_exp`(used by Anthropic).
- We change the loss to `valid_acc` and `pair_dist` to monitor progress during training.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
169
170
171
172
173
174
- We add special token to the end of the sequence to get better result.
- We use cosine-reducing lr-scheduler for RM training.
- We set value_head as 1 liner layer and initialize the weight of value_head using N(0,1/(d_model + 1)) distribution.
- We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper](https://arxiv.org/abs/2204.05862).

### Experiment result
175

Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
176
177
Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):

BlueRum's avatar
BlueRum committed
178
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
179
180
181

<div align=left>Our training & test result of bloom-560m for 1 epoch:

BlueRum's avatar
BlueRum committed
182
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225262950-a7f0a686-25de-44ec-98f2-11b83ea86674.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
183

BlueRum's avatar
BlueRum committed
184
<div align=left>We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
185

BlueRum's avatar
BlueRum committed
186
### Arg List
187
188
189
190
191
192
193
194
195
196
197
198
199
200

- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- `--model`: model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- `--pretrain`: pretrain model, type=str, default=None
- `--model_path`: the path of rm model(if continue to train), type=str, default=None
- `--save_path`: path to save the model, type=str, default='output'
- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
- `--max_epochs`: max epochs for training, type=int, default=3
- `--dataset`: dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
- `--subset`: subset of the dataset, type=str, default=None
- `--batch_size`: batch size while training, type=int, default=4
- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
- `--loss_func`: which kind of loss function, choices=['log_sig', 'log_exp']
- `--max_len`: max sentence length for generation, type=int, default=512
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
201

202
## Stage3 - Training model using prompts with RL
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
203

BlueRum's avatar
BlueRum committed
204
Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
205

BlueRum's avatar
BlueRum committed
206
207
208
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width=800/>
</p>
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
209

BlueRum's avatar
BlueRum committed
210
You can run the `examples/train_prompts.sh` to start PPO training.
211

BlueRum's avatar
BlueRum committed
212
You can also use the cmd following to start PPO training.
213
[[Stage3 tutorial video]](https://www.youtube.com/watch?v=Z8wwSHxPL9g)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
214

215
```bash
BlueRum's avatar
BlueRum committed
216
torchrun --standalone --nproc_per_node=4 train_prompts.py \
217
218
219
220
221
222
223
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --prompt_dataset /path/to/your/prompt_dataset \
    --pretrain_dataset /path/to/your/pretrain_dataset \
    --rm_pretrain /your/pretrain/rm/definition \
    --rm_path /your/rm/model/path
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
224
```
225

226
Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data) to generate the prompt dataset.
227
228
Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data](https://github.com/XueFuzhao/InstructionWild/tree/main/data) in stage 1 supervised instructs tuning.

229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
**Note**: the required datasets follow the following format,

- `pretrain dataset`

  ```json
  [
      {
          "instruction": "Provide a list of the top 10 most popular mobile games in Asia",
          "input": "",
          "output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
          "id": 0
      },
      ...
  ]
  ```

- `prompt dataset`

  ```json
  [
      {
          "instruction": "Edit this paragraph to make it more concise: \"Yesterday, I went to the store and bought some things. Then, I came home and put them away. After that, I went for a walk and met some friends.\"",
          "id": 0
      },
      {
          "instruction": "Write a descriptive paragraph about a memorable vacation you went on",
          "id": 1
      },
      ...
  ]
  ```

BlueRum's avatar
BlueRum committed
261
### Arg List
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281

- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- `--model`: model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- `--pretrain`: pretrain model, type=str, default=None
- `--rm_model`: reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
- `--rm_pretrain`: pretrain model for reward model, type=str, default=None
- `--rm_path`: the path of rm model, type=str, default=None
- `--save_path`: path to save the model, type=str, default='output'
- `--prompt_dataset`: path of the prompt dataset, type=str, default=None
- `--pretrain_dataset`: path of the ptx dataset, type=str, default=None
- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
- `--num_episodes`: num of episodes for training, type=int, default=10
- `--num_update_steps`: number of steps to update policy per episode, type=int
- `--num_collect_steps`: number of steps to collect experience per episode, type=int
- `--train_batch_size`: batch size while training, type=int, default=8
- `--ptx_batch_size`: batch size to compute ptx loss, type=int, default=1
- `--experience_batch_size`: batch size to make experience, type=int, default=8
- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
- `--kl_coef`: kl_coef using for computing reward, type=float, default=0.1
- `--ptx_coef`: ptx_coef using for computing policy loss, type=float, default=0.9
BlueRum's avatar
BlueRum committed
282
283

## Inference example - After Stage3
284

BlueRum's avatar
BlueRum committed
285
286
We support different inference options, including int8 and int4 quantization.
For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
287
288

## Attention
289

BlueRum's avatar
BlueRum committed
290
The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
291
292

#### data
293

Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
294
295
296
297
298
299
300
301
302
- [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
- [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
- [ ] [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
- [ ] [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)

## Support Model

### GPT
303
304
305
306
307
308
309

- [x] GPT2-S (s)
- [x] GPT2-M (m)
- [x] GPT2-L (l)
- [x] GPT2-XL (xl)
- [x] GPT2-4B (4b)
- [ ] GPT2-6B (6b)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
310
311

### BLOOM
312

Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
313
314
315
316
- [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
- [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
- [x] [BLOOM-7b](https://huggingface.co/bigscience/bloom-7b1)
BlueRum's avatar
BlueRum committed
317
- [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
318
319

### OPT
320

Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
321
322
- [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
- [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
323
324
325
- [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
- [x] [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)
- [x] [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
326
327
- [ ] [OPT-13B](https://huggingface.co/facebook/opt-13b)
- [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)
BlueRum's avatar
BlueRum committed
328
329

### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
330
331
332
333
334

- [x] LLaMA-7B
- [x] LLaMA-13B
- [ ] LLaMA-33B
- [ ] LLaMA-65B
335
336
337
338
339
340
341
342

## Add your own models

If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.

You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model

here are some example code for a NewModel named `Coati`.
343
if it is supported in huggingface [transformers](https://github.com/huggingface/transformers), you can load it by `from_pretrained`, o
344
345
346
r you can build your own model by yourself.

### Actor model
347
348

```python
349
350
351
352
353
354
355
356
357
358
359
360
from ..base import Actor
from transformers.models.coati import CoatiModel

class CoatiActor(Actor):
    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
361
            model = build_model() # load your own model if it is not support in transformers
362
363
364
365
366

        super().__init__(model, lora_rank, lora_train_bias)
```

### Reward model
367
368

```python
369
370
371
372
373
374
375
376
377
378
379
380
381
from ..base import RewardModel
from transformers.models.coati import CoatiModel

class CoatiRM(RewardModel):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
382
            model = build_model() # load your own model if it is not support in transformers
383
384
385
386
387
388
389
390

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```

### Critic model

391
```python
392
393
394
395
396
397
398
399
400
401
402
403
from ..base import Critic
from transformers.models.coati import CoatiModel

class CoatiCritic(Critic):
    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
404
            model = build_model() # load your own model if it is not support in transformers
405
406
407
408
409

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```