README.md 16 KB
Newer Older
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
1
2
# Examples

3
4
5
6
7
8
## Table of Contents

- [Examples](#examples)
  - [Table of Contents](#table-of-contents)
  - [Install requirements](#install-requirements)
  - [Supervised datasets collection](#supervised-datasets-collection)
9
    - [Conversation dataset generation](#conversation-dataset-generation)
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
  - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
    - [Arg List](#arg-list)
  - [Stage2 - Training reward model](#stage2---training-reward-model)
    - [Features and tricks in RM training](#features-and-tricks-in-rm-training)
    - [Experiment result](#experiment-result)
    - [Arg List](#arg-list-1)
  - [Stage3 - Training model using prompts with RL](#stage3---training-model-using-prompts-with-rl)
    - [Arg List](#arg-list-2)
  - [Inference example - After Stage3](#inference-example---after-stage3)
  - [Attention](#attention)
      - [data](#data)
  - [Support Model](#support-model)
    - [GPT](#gpt)
    - [BLOOM](#bloom)
    - [OPT](#opt)
    - [LLaMA](#llama)
  - [Add your own models](#add-your-own-models)
    - [Actor model](#actor-model)
    - [Reward model](#reward-model)
    - [Critic model](#critic-model)


---
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
33
34
35
36
37
38
## Install requirements

```shell
pip install -r requirements.txt
```

BlueRum's avatar
BlueRum committed
39
40
## Supervised datasets collection

41
We collected 104K bilingual dataset of Chinese and English, and you can find the datasets in this repo
BlueRum's avatar
BlueRum committed
42
43
44
45
46
47
48
[InstructionWild](https://github.com/XueFuzhao/InstructionWild).

The following pic shows how we collected the data.
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width=500/>
</p>

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
### Conversation dataset generation

In order to further improve the model's ability to handle multi-turn conversations, we need to include samples with multi-turn conversations in the dataset. However, the samples in InstructWild and Alpaca datasets currently consist of only single-turn conversations, and their dataset organization is not suitable for storing multi-turn conversations. Additionally, after converting the aforementioned datasets, we also need to include multi-turn conversation datasets like ShareGPT, and we should transform them into the training format supported by ColossalChat.

A sample of conversation dataset should have the following fields:

* `type` (str, optional): The type of the data sample.
* `language` (str, optional): The language of the data sample.
* `dataset` (str, optional): The dataset the data sample originates from.
* `conversations` (str, compulsory): Conversation content of the data sample.
* `id` (int, optional): The ID of the data sample.

A simple example:
```json
{
    "type": "instruction",
    "language": "English",
    "dataset": "Alpaca",
    "conversations": [
        {
            "from": "human",
            "value": "Give three tips for staying healthy."
        },
        {
            "from": "gpt",
            "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
        }
    ],
    "id": 1
}
```

> **NOTE:**  Only key `conversations` is compulsary for training and other keys serve as metadata. The length of `conversations` varies.

You can run the `examples/generate_conversation_dataset.py` to generate a conversation dataset supported by ColossalChat.

You can use the following cmd to generate conversation dataset.
```
python generate_conversation_dataset.py \
    --dataset "All"
    --save_path "/path/to/dataset"
```

BlueRum's avatar
BlueRum committed
92
93
94
## Stage1 - Supervised instructs tuning

Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.
95
[[Stage1 tutorial video]](https://www.youtube.com/watch?v=-qFBZFmOJfg)
BlueRum's avatar
BlueRum committed
96
97
98
99

You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.

You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
100
```
BlueRum's avatar
BlueRum committed
101
102
103
104
105
106
107
108
torchrun --standalone --nproc_per_node=4 train_sft.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --log_interval 10 \
    --save_path  /path/to/Coati-7B \
    --dataset /path/to/data.json \
    --batch_size 4 \
109
    --accumulation_steps 8 \
BlueRum's avatar
BlueRum committed
110
111
112
    --lr 2e-5 \
    --max_datasets_size 512 \
    --max_epochs 1 \
113
    --grad_checkpoint
BlueRum's avatar
BlueRum committed
114
115
```
### Arg List
116
- --strategy:          the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
BlueRum's avatar
BlueRum committed
117
118
119
120
121
122
123
124
125
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --max_datasets_size: the max size of dataset, type=int, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --log_interval:      how many steps to log, type=int, default=100
126
- --grad_checkpoint:   enable gradient checkpointing, type=bool, default=False
BlueRum's avatar
BlueRum committed
127
128
129
130

## Stage2 - Training reward model

We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.
131
[[Stage2 tutorial video]](https://www.youtube.com/watch?v=gMx2CApKhuo)
BlueRum's avatar
BlueRum committed
132
133

You can run the `examples/train_rm.sh` to start a reward model training.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
134

BlueRum's avatar
BlueRum committed
135
136
You can also use the following cmd to start training a reward model.
```
kingkingofall's avatar
kingkingofall committed
137
torchrun --standalone --nproc_per_node=4 train_reward_model.py \
BlueRum's avatar
BlueRum committed
138
139
140
141
142
143
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --loss_fn 'log_exp'\
    --save_path 'rmstatic.pt' \
```
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
144
145
146
147
148
149
150
151
152
153
154
155
### Features and tricks in RM training
- We support [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
- We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
- We change the loss to valid_acc and pair_dist to monitor progress during training.
- We add special token to the end of the sequence to get better result.
- We use cosine-reducing lr-scheduler for RM training.
- We set value_head as 1 liner layer and initialize the weight of value_head using N(0,1/(d_model + 1)) distribution.
- We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper](https://arxiv.org/abs/2204.05862).

### Experiment result
Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):

BlueRum's avatar
BlueRum committed
156
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
157
158
159

<div align=left>Our training & test result of bloom-560m for 1 epoch:

BlueRum's avatar
BlueRum committed
160
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225262950-a7f0a686-25de-44ec-98f2-11b83ea86674.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
161

BlueRum's avatar
BlueRum committed
162
<div align=left>We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
163

BlueRum's avatar
BlueRum committed
164
### Arg List
165
- --strategy:          the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
BlueRum's avatar
BlueRum committed
166
167
168
169
170
171
172
173
174
175
176
177
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --model_path:        the path of rm model(if continue to train), type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --dataset:           dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
- --subset:            subset of the dataset, type=str, default=None
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --loss_func:         which kind of loss function, choices=['log_sig', 'log_exp']
- --max_len:           max sentence length for generation, type=int, default=512
178
- --test:              whether is only testing, if it's true, the dataset will be small
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
179

180
## Stage3 - Training model using prompts with RL
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
181

BlueRum's avatar
BlueRum committed
182
Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
183

BlueRum's avatar
BlueRum committed
184
185
186
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width=800/>
</p>
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
187

BlueRum's avatar
BlueRum committed
188
189
You can run the `examples/train_prompts.sh` to start PPO training.
You can also use the cmd following to start PPO training.
190
[[Stage3 tutorial video]](https://www.youtube.com/watch?v=Z8wwSHxPL9g)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
191
192

```
BlueRum's avatar
BlueRum committed
193
194
195
196
torchrun --standalone --nproc_per_node=4 train_prompts.py \
         --pretrain "/path/to/LLaMa-7B/" \
         --model 'llama' \
         --strategy colossalai_zero2 \
197
         --prompt_dataset /path/to/your/prompt_dataset \
BlueRum's avatar
BlueRum committed
198
         --pretrain_dataset /path/to/your/pretrain_dataset \
199
         --rm_pretrain /your/pretrain/rm/definition \
BlueRum's avatar
BlueRum committed
200
         --rm_path /your/rm/model/path
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
201
```
202

203
Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data) to generate the prompt dataset.
204
205
Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data](https://github.com/XueFuzhao/InstructionWild/tree/main/data) in stage 1 supervised instructs tuning.

BlueRum's avatar
BlueRum committed
206
### Arg List
207
- --strategy:          the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
BlueRum's avatar
BlueRum committed
208
209
- --model:             model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
210
- --rm_model:          reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
BlueRum's avatar
BlueRum committed
211
212
213
- --rm_pretrain:       pretrain model for reward model, type=str, default=None
- --rm_path:           the path of rm model, type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
214
- --prompt_dataset:       path of the prompt dataset, type=str, default=None
BlueRum's avatar
BlueRum committed
215
216
217
- --pretrain_dataset:  path of the ptx dataset, type=str, default=None
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --num_episodes:      num of episodes for training, type=int, default=10
218
219
- --num_update_steps:  number of steps to update policy per episode, type=int
- --num_collect_steps: number of steps to collect experience per episode, type=int
BlueRum's avatar
BlueRum committed
220
221
222
223
224
225
226
227
228
229
- --train_batch_size:  batch size while training, type=int, default=8
- --ptx_batch_size:    batch size to compute ptx loss, type=int, default=1
- --experience_batch_size: batch size to make experience, type=int, default=8
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --kl_coef:           kl_coef using for computing reward, type=float, default=0.1
- --ptx_coef:          ptx_coef using for computing policy loss, type=float, default=0.9

## Inference example - After Stage3
We support different inference options, including int8 and int4 quantization.
For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
230
231
232


## Attention
BlueRum's avatar
BlueRum committed
233
The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
234
235
236
237
238
239
240
241
242
243
244
245
246
247

#### data
- [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
- [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
- [ ] [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
- [ ] [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)

## Support Model

### GPT
- [x]  GPT2-S (s)
- [x]  GPT2-M (m)
- [x]  GPT2-L (l)
248
- [x]  GPT2-XL (xl)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
249
250
251
252
253
254
255
256
- [x]  GPT2-4B (4b)
- [ ]  GPT2-6B (6b)

### BLOOM
- [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
- [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
- [x] [BLOOM-7b](https://huggingface.co/bigscience/bloom-7b1)
BlueRum's avatar
BlueRum committed
257
- [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
258
259
260
261

### OPT
- [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
- [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
262
263
264
- [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
- [x] [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)
- [x] [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
265
266
- [ ] [OPT-13B](https://huggingface.co/facebook/opt-13b)
- [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)
BlueRum's avatar
BlueRum committed
267
268
269
270
271
272

### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
- [x]  LLaMA-7B
- [x]  LLaMA-13B
- [ ]  LLaMA-33B
- [ ]  LLaMA-65B
273
274
275
276
277
278
279
280

## Add your own models

If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.

You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model

here are some example code for a NewModel named `Coati`.
281
if it is supported in huggingface [transformers](https://github.com/huggingface/transformers), you can load it by `from_pretrained`, o
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
r you can build your own model by yourself.

### Actor model
```
from ..base import Actor
from transformers.models.coati import CoatiModel

class CoatiActor(Actor):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
299
            model = build_model() # load your own model if it is not support in transformers
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318

        super().__init__(model, lora_rank, lora_train_bias)
```

### Reward model
```
from ..base import RewardModel
from transformers.models.coati import CoatiModel

class CoatiRM(RewardModel):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
319
            model = build_model() # load your own model if it is not support in transformers
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```

### Critic model

```
from ..base import Critic
from transformers.models.coati import CoatiModel

class CoatiCritic(Critic):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
342
            model = build_model() # load your own model if it is not support in transformers
343
344
345
346
347

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```