README.md 13.8 KB
Newer Older
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
1
2
# Examples

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
## Table of Contents

- [Examples](#examples)
  - [Table of Contents](#table-of-contents)
  - [Install requirements](#install-requirements)
  - [Supervised datasets collection](#supervised-datasets-collection)
  - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
    - [Arg List](#arg-list)
  - [Stage2 - Training reward model](#stage2---training-reward-model)
    - [Features and tricks in RM training](#features-and-tricks-in-rm-training)
    - [Experiment result](#experiment-result)
    - [Arg List](#arg-list-1)
  - [Stage3 - Training model using prompts with RL](#stage3---training-model-using-prompts-with-rl)
    - [Arg List](#arg-list-2)
  - [Inference example - After Stage3](#inference-example---after-stage3)
  - [Attention](#attention)
      - [data](#data)
  - [Support Model](#support-model)
    - [GPT](#gpt)
    - [BLOOM](#bloom)
    - [OPT](#opt)
    - [LLaMA](#llama)
  - [Add your own models](#add-your-own-models)
    - [Actor model](#actor-model)
    - [LM model](#lm-model)
    - [Reward model](#reward-model)
    - [Critic model](#critic-model)


---
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
33
34
35
36
37
38
## Install requirements

```shell
pip install -r requirements.txt
```

BlueRum's avatar
BlueRum committed
39
40
## Supervised datasets collection

41
We collected 104K bilingual dataset of Chinese and English, and you can find the datasets in this repo
BlueRum's avatar
BlueRum committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
[InstructionWild](https://github.com/XueFuzhao/InstructionWild).

The following pic shows how we collected the data.
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width=500/>
</p>

## Stage1 - Supervised instructs tuning

Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.

You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.

You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
56
```
BlueRum's avatar
BlueRum committed
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
torchrun --standalone --nproc_per_node=4 train_sft.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --log_interval 10 \
    --save_path  /path/to/Coati-7B \
    --dataset /path/to/data.json \
    --batch_size 4 \
    --accimulation_steps 8 \
    --lr 2e-5 \
    --max_datasets_size 512 \
    --max_epochs 1 \
```
### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='naive'
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --max_datasets_size: the max size of dataset, type=int, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --log_interval:      how many steps to log, type=int, default=100

## Stage2 - Training reward model

We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.

You can run the `examples/train_rm.sh` to start a reward model training.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
87

BlueRum's avatar
BlueRum committed
88
89
You can also use the following cmd to start training a reward model.
```
kingkingofall's avatar
kingkingofall committed
90
torchrun --standalone --nproc_per_node=4 train_reward_model.py \
BlueRum's avatar
BlueRum committed
91
92
93
94
95
96
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --loss_fn 'log_exp'\
    --save_path 'rmstatic.pt' \
```
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
97
98
99
100
101
102
103
104
105
106
107
108
### Features and tricks in RM training
- We support [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
- We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
- We change the loss to valid_acc and pair_dist to monitor progress during training.
- We add special token to the end of the sequence to get better result.
- We use cosine-reducing lr-scheduler for RM training.
- We set value_head as 1 liner layer and initialize the weight of value_head using N(0,1/(d_model + 1)) distribution.
- We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper](https://arxiv.org/abs/2204.05862).

### Experiment result
Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):

BlueRum's avatar
BlueRum committed
109
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
110
111
112

<div align=left>Our training & test result of bloom-560m for 1 epoch:

BlueRum's avatar
BlueRum committed
113
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225262950-a7f0a686-25de-44ec-98f2-11b83ea86674.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
114

BlueRum's avatar
BlueRum committed
115
<div align=left>We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
116

BlueRum's avatar
BlueRum committed
117
118
119
120
121
122
123
124
125
126
127
128
129
130
### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='naive'
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --model_path:        the path of rm model(if continue to train), type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --dataset:           dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
- --subset:            subset of the dataset, type=str, default=None
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --loss_func:         which kind of loss function, choices=['log_sig', 'log_exp']
- --max_len:           max sentence length for generation, type=int, default=512
131
- --test:              whether is only testing, if it's true, the dataset will be small
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
132

133
## Stage3 - Training model using prompts with RL
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
134

BlueRum's avatar
BlueRum committed
135
Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
136

BlueRum's avatar
BlueRum committed
137
138
139
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width=800/>
</p>
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
140

BlueRum's avatar
BlueRum committed
141
142
You can run the `examples/train_prompts.sh` to start PPO training.
You can also use the cmd following to start PPO training.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
143
144

```
BlueRum's avatar
BlueRum committed
145
146
147
148
149
150
151
152
torchrun --standalone --nproc_per_node=4 train_prompts.py \
         --pretrain "/path/to/LLaMa-7B/" \
         --model 'llama' \
         --strategy colossalai_zero2 \
         --prompt_path /path/to/your/prompt_dataset \
         --pretrain_dataset /path/to/your/pretrain_dataset \
         --rm_pretrain /your/pretrain/rm/defination \
         --rm_path /your/rm/model/path
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
153
```
BlueRum's avatar
BlueRum committed
154
155
156
157
### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='naive'
- --model:             model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
158
- --rm_model:          reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
BlueRum's avatar
BlueRum committed
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
- --rm_pretrain:       pretrain model for reward model, type=str, default=None
- --rm_path:           the path of rm model, type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --prompt_path:       path of the prompt dataset, type=str, default=None
- --pretrain_dataset:  path of the ptx dataset, type=str, default=None
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --num_episodes:      num of episodes for training, type=int, default=10
- --max_epochs:        max epochs for training in one episode, type=int, default=5
- --max_timesteps:     max episodes in one batch, type=int, default=10
- --update_timesteps:  timesteps to update, type=int, default=10
- --train_batch_size:  batch size while training, type=int, default=8
- --ptx_batch_size:    batch size to compute ptx loss, type=int, default=1
- --experience_batch_size: batch size to make experience, type=int, default=8
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --kl_coef:           kl_coef using for computing reward, type=float, default=0.1
- --ptx_coef:          ptx_coef using for computing policy loss, type=float, default=0.9

## Inference example - After Stage3
We support different inference options, including int8 and int4 quantization.
For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
179
180
181


## Attention
BlueRum's avatar
BlueRum committed
182
The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
183
184
185
186
187
188
189
190
191
192
193
194
195
196

#### data
- [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
- [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
- [ ] [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
- [ ] [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)

## Support Model

### GPT
- [x]  GPT2-S (s)
- [x]  GPT2-M (m)
- [x]  GPT2-L (l)
197
- [x]  GPT2-XL (xl)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
198
199
200
201
202
203
204
205
- [x]  GPT2-4B (4b)
- [ ]  GPT2-6B (6b)

### BLOOM
- [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
- [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
- [x] [BLOOM-7b](https://huggingface.co/bigscience/bloom-7b1)
BlueRum's avatar
BlueRum committed
206
- [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
207
208
209
210

### OPT
- [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
- [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
211
212
213
- [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
- [x] [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)
- [x] [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
214
215
- [ ] [OPT-13B](https://huggingface.co/facebook/opt-13b)
- [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)
BlueRum's avatar
BlueRum committed
216
217
218
219
220
221

### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
- [x]  LLaMA-7B
- [x]  LLaMA-13B
- [ ]  LLaMA-33B
- [ ]  LLaMA-65B
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247

## Add your own models

If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.

You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model

here are some example code for a NewModel named `Coati`.
if it is supported in huggingaface [transformers](https://github.com/huggingface/transformers), you can load it by `from_pretrained`, o
r you can build your own model by yourself.

### Actor model
```
from ..base import Actor
from transformers.models.coati import CoatiModel

class CoatiActor(Actor):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
248
            model = build_model() # load your own model if it is not support in transformers
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268

        super().__init__(model, lora_rank, lora_train_bias)
```

### LM model

```
from ..base import LM
from transformers.models.coati import CoatiModel

class GPTLM(LM):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
269
            model = build_model() # load your own model if it is not support in transformers
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290

        super().__init__(model, lora_rank, lora_train_bias)

    def forward(self, input_ids, attention_mask=None, labels=None, **kwargs):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels, **kwargs)
```
### Reward model
```
from ..base import RewardModel
from transformers.models.coati import CoatiModel

class CoatiRM(RewardModel):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
291
            model = build_model() # load your own model if it is not support in transformers
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```

### Critic model

```
from ..base import Critic
from transformers.models.coati import CoatiModel

class CoatiCritic(Critic):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
314
            model = build_model() # load your own model if it is not support in transformers
315
316
317
318
319

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```