README.md 13.8 KB
Newer Older
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
1
2
# Examples

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
## Table of Contents

- [Examples](#examples)
  - [Table of Contents](#table-of-contents)
  - [Install requirements](#install-requirements)
  - [Supervised datasets collection](#supervised-datasets-collection)
  - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
    - [Arg List](#arg-list)
  - [Stage2 - Training reward model](#stage2---training-reward-model)
    - [Features and tricks in RM training](#features-and-tricks-in-rm-training)
    - [Experiment result](#experiment-result)
    - [Arg List](#arg-list-1)
  - [Stage3 - Training model using prompts with RL](#stage3---training-model-using-prompts-with-rl)
    - [Arg List](#arg-list-2)
  - [Inference example - After Stage3](#inference-example---after-stage3)
  - [Attention](#attention)
      - [data](#data)
  - [Support Model](#support-model)
    - [GPT](#gpt)
    - [BLOOM](#bloom)
    - [OPT](#opt)
    - [LLaMA](#llama)
  - [Add your own models](#add-your-own-models)
    - [Actor model](#actor-model)
    - [Reward model](#reward-model)
    - [Critic model](#critic-model)


---
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
32
33
34
35
36
37
## Install requirements

```shell
pip install -r requirements.txt
```

BlueRum's avatar
BlueRum committed
38
39
## Supervised datasets collection

40
We collected 104K bilingual dataset of Chinese and English, and you can find the datasets in this repo
BlueRum's avatar
BlueRum committed
41
42
43
44
45
46
47
48
49
50
51
52
53
54
[InstructionWild](https://github.com/XueFuzhao/InstructionWild).

The following pic shows how we collected the data.
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width=500/>
</p>

## Stage1 - Supervised instructs tuning

Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.

You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.

You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
55
```
BlueRum's avatar
BlueRum committed
56
57
58
59
60
61
62
63
torchrun --standalone --nproc_per_node=4 train_sft.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --log_interval 10 \
    --save_path  /path/to/Coati-7B \
    --dataset /path/to/data.json \
    --batch_size 4 \
64
    --accumulation_steps 8 \
BlueRum's avatar
BlueRum committed
65
66
67
    --lr 2e-5 \
    --max_datasets_size 512 \
    --max_epochs 1 \
68
    --grad_checkpoint
BlueRum's avatar
BlueRum committed
69
70
```
### Arg List
71
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
BlueRum's avatar
BlueRum committed
72
73
74
75
76
77
78
79
80
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --max_datasets_size: the max size of dataset, type=int, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --log_interval:      how many steps to log, type=int, default=100
81
- --grad_checkpoint:   enable gradient checkpointing, type=bool, default=False
BlueRum's avatar
BlueRum committed
82
83
84
85
86
87

## Stage2 - Training reward model

We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.

You can run the `examples/train_rm.sh` to start a reward model training.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
88

BlueRum's avatar
BlueRum committed
89
90
You can also use the following cmd to start training a reward model.
```
kingkingofall's avatar
kingkingofall committed
91
torchrun --standalone --nproc_per_node=4 train_reward_model.py \
BlueRum's avatar
BlueRum committed
92
93
94
95
96
97
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --loss_fn 'log_exp'\
    --save_path 'rmstatic.pt' \
```
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
98
99
100
101
102
103
104
105
106
107
108
109
### Features and tricks in RM training
- We support [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
- We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
- We change the loss to valid_acc and pair_dist to monitor progress during training.
- We add special token to the end of the sequence to get better result.
- We use cosine-reducing lr-scheduler for RM training.
- We set value_head as 1 liner layer and initialize the weight of value_head using N(0,1/(d_model + 1)) distribution.
- We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper](https://arxiv.org/abs/2204.05862).

### Experiment result
Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):

BlueRum's avatar
BlueRum committed
110
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
111
112
113

<div align=left>Our training & test result of bloom-560m for 1 epoch:

BlueRum's avatar
BlueRum committed
114
<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225262950-a7f0a686-25de-44ec-98f2-11b83ea86674.png">
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
115

BlueRum's avatar
BlueRum committed
116
<div align=left>We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
117

BlueRum's avatar
BlueRum committed
118
### Arg List
119
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
BlueRum's avatar
BlueRum committed
120
121
122
123
124
125
126
127
128
129
130
131
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --model_path:        the path of rm model(if continue to train), type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --dataset:           dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
- --subset:            subset of the dataset, type=str, default=None
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --loss_func:         which kind of loss function, choices=['log_sig', 'log_exp']
- --max_len:           max sentence length for generation, type=int, default=512
132
- --test:              whether is only testing, if it's true, the dataset will be small
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
133

134
## Stage3 - Training model using prompts with RL
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
135

BlueRum's avatar
BlueRum committed
136
Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
137

BlueRum's avatar
BlueRum committed
138
139
140
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width=800/>
</p>
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
141

BlueRum's avatar
BlueRum committed
142
143
You can run the `examples/train_prompts.sh` to start PPO training.
You can also use the cmd following to start PPO training.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
144
145

```
BlueRum's avatar
BlueRum committed
146
147
148
149
torchrun --standalone --nproc_per_node=4 train_prompts.py \
         --pretrain "/path/to/LLaMa-7B/" \
         --model 'llama' \
         --strategy colossalai_zero2 \
150
         --prompt_dataset /path/to/your/prompt_dataset \
BlueRum's avatar
BlueRum committed
151
         --pretrain_dataset /path/to/your/pretrain_dataset \
152
         --rm_pretrain /your/pretrain/rm/definition \
BlueRum's avatar
BlueRum committed
153
         --rm_path /your/rm/model/path
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
154
```
155

156
Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data) to generate the prompt dataset.
157
158
Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data](https://github.com/XueFuzhao/InstructionWild/tree/main/data) in stage 1 supervised instructs tuning.

BlueRum's avatar
BlueRum committed
159
### Arg List
160
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
BlueRum's avatar
BlueRum committed
161
162
- --model:             model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
163
- --rm_model:          reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
BlueRum's avatar
BlueRum committed
164
165
166
- --rm_pretrain:       pretrain model for reward model, type=str, default=None
- --rm_path:           the path of rm model, type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
167
- --prompt_dataset:       path of the prompt dataset, type=str, default=None
BlueRum's avatar
BlueRum committed
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
- --pretrain_dataset:  path of the ptx dataset, type=str, default=None
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --num_episodes:      num of episodes for training, type=int, default=10
- --max_epochs:        max epochs for training in one episode, type=int, default=5
- --max_timesteps:     max episodes in one batch, type=int, default=10
- --update_timesteps:  timesteps to update, type=int, default=10
- --train_batch_size:  batch size while training, type=int, default=8
- --ptx_batch_size:    batch size to compute ptx loss, type=int, default=1
- --experience_batch_size: batch size to make experience, type=int, default=8
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --kl_coef:           kl_coef using for computing reward, type=float, default=0.1
- --ptx_coef:          ptx_coef using for computing policy loss, type=float, default=0.9

## Inference example - After Stage3
We support different inference options, including int8 and int4 quantization.
For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
184
185
186


## Attention
BlueRum's avatar
BlueRum committed
187
The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
188
189
190
191
192
193
194
195
196
197
198
199
200
201

#### data
- [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
- [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
- [ ] [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
- [ ] [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)

## Support Model

### GPT
- [x]  GPT2-S (s)
- [x]  GPT2-M (m)
- [x]  GPT2-L (l)
202
- [x]  GPT2-XL (xl)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
203
204
205
206
207
208
209
210
- [x]  GPT2-4B (4b)
- [ ]  GPT2-6B (6b)

### BLOOM
- [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
- [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
- [x] [BLOOM-7b](https://huggingface.co/bigscience/bloom-7b1)
BlueRum's avatar
BlueRum committed
211
- [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
212
213
214
215

### OPT
- [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
- [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
216
217
218
- [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
- [x] [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)
- [x] [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b)
Fazzie-Maqianli's avatar
Fazzie-Maqianli committed
219
220
- [ ] [OPT-13B](https://huggingface.co/facebook/opt-13b)
- [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)
BlueRum's avatar
BlueRum committed
221
222
223
224
225
226

### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
- [x]  LLaMA-7B
- [x]  LLaMA-13B
- [ ]  LLaMA-33B
- [ ]  LLaMA-65B
227
228
229
230
231
232
233
234

## Add your own models

If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.

You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model

here are some example code for a NewModel named `Coati`.
235
if it is supported in huggingface [transformers](https://github.com/huggingface/transformers), you can load it by `from_pretrained`, o
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
r you can build your own model by yourself.

### Actor model
```
from ..base import Actor
from transformers.models.coati import CoatiModel

class CoatiActor(Actor):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
253
            model = build_model() # load your own model if it is not support in transformers
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272

        super().__init__(model, lora_rank, lora_train_bias)
```

### Reward model
```
from ..base import RewardModel
from transformers.models.coati import CoatiModel

class CoatiRM(RewardModel):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
273
            model = build_model() # load your own model if it is not support in transformers
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```

### Critic model

```
from ..base import Critic
from transformers.models.coati import CoatiModel

class CoatiCritic(Critic):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
296
            model = build_model() # load your own model if it is not support in transformers
297
298
299
300
301

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```