README.md 9.8 KB
Newer Older
chenych's avatar
chenych committed
1
2
3
The [dataset_info.json](dataset_info.json) contains all available datasets. If you are using a custom dataset, please **make sure** to add a *dataset description* in `dataset_info.json` and specify `dataset: dataset_name` before training to use it.

Currently we support datasets in **alpaca** and **sharegpt** format.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
4
5
6
7

```json
"dataset_name": {
  "hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)",
chenych's avatar
chenych committed
8
  "ms_hub_url": "the name of the dataset repository on the Model Scope hub. (if specified, ignore script_url and file_name)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
9
  "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)",
chenych's avatar
chenych committed
10
11
12
  "file_name": "the name of the dataset folder or dataset file in this directory. (required if above are not specified)",
  "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
  "ranking": "whether the dataset is a preference dataset or not. (default: False)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
13
  "subset": "the name of the subset. (optional, default: None)",
chenych's avatar
chenych committed
14
  "split": "the name of dataset split to be used. (optional, default: train)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
15
  "folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
chenych's avatar
chenych committed
16
  "num_samples": "the number of samples in the dataset to be used. (optional, default: None)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
17
18
19
20
21
22
23
  "columns (optional)": {
    "prompt": "the column name in the dataset containing the prompts. (default: instruction)",
    "query": "the column name in the dataset containing the queries. (default: input)",
    "response": "the column name in the dataset containing the responses. (default: output)",
    "history": "the column name in the dataset containing the histories. (default: None)",
    "messages": "the column name in the dataset containing the messages. (default: conversations)",
    "system": "the column name in the dataset containing the system prompts. (default: None)",
chenych's avatar
chenych committed
24
25
26
27
28
    "tools": "the column name in the dataset containing the tool description. (default: None)",
    "images": "the column name in the dataset containing the image inputs. (default: None)",
    "chosen": "the column name in the dataset containing the chosen answers. (default: None)",
    "rejected": "the column name in the dataset containing the rejected answers. (default: None)",
    "kto_tag": "the column name in the dataset containing the kto tags. (default: None)"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
29
30
31
32
33
34
35
36
37
38
39
40
41
  },
  "tags (optional, used for the sharegpt format)": {
    "role_tag": "the key in the message represents the identity. (default: from)",
    "content_tag": "the key in the message represents the content. (default: value)",
    "user_tag": "the value of the role_tag represents the user. (default: human)",
    "assistant_tag": "the value of the role_tag represents the assistant. (default: gpt)",
    "observation_tag": "the value of the role_tag represents the tool results. (default: observation)",
    "function_tag": "the value of the role_tag represents the function call. (default: function_call)",
    "system_tag": "the value of the role_tag represents the system prompt. (default: system, can override system column)"
  }
}
```

chenych's avatar
chenych committed
42
43
44
45
46
## Alpaca Format

### Supervised Fine-Tuning Dataset

* [Example dataset](alpaca_en_demo.json)
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
47

chenych's avatar
chenych committed
48
In supervised fine-tuning, the `instruction` column will be concatenated with the `input` column and used as the human prompt, then the human prompt would be `instruction\ninput`. The `output` column represents the model response.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
49

chenych's avatar
chenych committed
50
51
52
The `system` column will be used as the system prompt if specified.

The `history` column is a list consisting of string tuples representing prompt-response pairs in the history messages. Note that the responses in the history **will also be learned by the model** in supervised fine-tuning.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
53
54
55
56

```json
[
  {
chenych's avatar
chenych committed
57
58
    "instruction": "human instruction (required)",
    "input": "human input (optional)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
59
60
61
    "output": "model response (required)",
    "system": "system prompt (optional)",
    "history": [
chenych's avatar
chenych committed
62
63
      ["human instruction in the first round (optional)", "model response in the first round (optional)"],
      ["human instruction in the second round (optional)", "model response in the second round (optional)"]
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
64
65
66
67
68
    ]
  }
]
```

chenych's avatar
chenych committed
69
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
70
71
72

```json
"dataset_name": {
chenych's avatar
chenych committed
73
  "file_name": "data.json",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
74
75
76
77
78
79
80
81
82
83
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "system": "system",
    "history": "history"
  }
}
```

chenych's avatar
chenych committed
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
### Pre-training Dataset

- [Example dataset](c4_demo.json)

In pre-training, only the `text` column will be used for model learning.

```json
[
  {"text": "document"},
  {"text": "document"}
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "columns": {
    "prompt": "text"
  }
}
```

### Preference Dataset

Preference datasets are used for reward modeling, DPO training and ORPO training.

It requires a better response in `chosen` column and a worse response in `rejected` column.

```json
[
  {
    "instruction": "human instruction (required)",
    "input": "human input (optional)",
    "chosen": "chosen answer (required)",
    "rejected": "rejected answer (required)"
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "ranking": true,
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "chosen": "chosen",
    "rejected": "rejected"
  }
}
```

### KTO Dataset

- [Example dataset](kto_en_demo.json)

KTO datasets require a extra `kto_tag` column containing the boolean human feedback.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
145

chenych's avatar
chenych committed
146
147
148
149
150
151
152
153
154
155
```json
[
  {
    "instruction": "human instruction (required)",
    "input": "human input (optional)",
    "output": "model response (required)",
    "kto_tag": "human feedback [true/false] (required)"
  }
]
```
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
156

chenych's avatar
chenych committed
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "kto_tag": "kto_tag"
  }
}
```

### Multimodal Dataset

- [Example dataset](mllm_demo.json)

Multimodal datasets require a `images` column containing the paths to the input images. Currently we only support one image.

```json
[
  {
    "instruction": "human instruction (required)",
    "input": "human input (optional)",
    "output": "model response (required)",
    "images": [
      "image path (required)"
    ]
  }
]
```
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
189

chenych's avatar
chenych committed
190
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
191
192

```json
chenych's avatar
chenych committed
193
194
195
196
197
198
199
200
"dataset_name": {
  "file_name": "data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "images": "images"
  }
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
201
202
203
}
```

chenych's avatar
chenych committed
204
## Sharegpt Format
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
205

chenych's avatar
chenych committed
206
### Supervised Fine-Tuning Dataset
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
207

chenych's avatar
chenych committed
208
209
210
211
212
- [Example dataset](glaive_toolcall_en_demo.json)

Compared to the alpaca format, the sharegpt format allows the datasets have **more roles**, such as human, gpt, observation and function. They are presented in a list of objects in the `conversations` column.

Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
213
214
215
216
217
218
219

```json
[
  {
    "conversations": [
      {
        "from": "human",
chenych's avatar
chenych committed
220
221
222
223
224
225
226
227
228
        "value": "human instruction"
      },
      {
        "from": "function_call",
        "value": "tool arguments"
      },
      {
        "from": "observation",
        "value": "tool result"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
229
230
231
232
233
234
235
236
237
238
239
240
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ],
    "system": "system prompt (optional)",
    "tools": "tool description (optional)"
  }
]
```

chenych's avatar
chenych committed
241
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
242
243
244

```json
"dataset_name": {
chenych's avatar
chenych committed
245
246
  "file_name": "data.json",
  "formatting": "sharegpt",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
247
248
249
250
  "columns": {
    "messages": "conversations",
    "system": "system",
    "tools": "tools"
chenych's avatar
chenych committed
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
  }
}
```

### Preference Dataset

- [Example dataset](dpo_en_demo.json)

Preference datasets in sharegpt format also require a better message in `chosen` column and a worse message in `rejected` column.

```json
[
  {
    "conversations": [
      {
        "from": "human",
        "value": "human instruction"
      },
      {
        "from": "gpt",
        "value": "model response"
      },
      {
        "from": "human",
        "value": "human instruction"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "chosen answer (required)"
    },
    "rejected": {
      "from": "gpt",
      "value": "rejected answer (required)"
    }
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "ranking": true,
  "columns": {
    "messages": "conversations",
    "chosen": "chosen",
    "rejected": "rejected"
  }
}
```

### OpenAI Format

The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt.

```json
[
  {
    "messages": [
      {
        "role": "system",
        "content": "system prompt (optional)"
      },
      {
        "role": "user",
        "content": "human instruction"
      },
      {
        "role": "assistant",
        "content": "model response"
      }
    ]
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "messages"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
338
339
  },
  "tags": {
chenych's avatar
chenych committed
340
341
342
343
344
    "role_tag": "role",
    "content_tag": "content",
    "user_tag": "user",
    "assistant_tag": "assistant",
    "system_tag": "system"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
345
346
347
348
  }
}
```

chenych's avatar
chenych committed
349
The KTO datasets and multimodal datasets in sharegpt format are similar to the alpaca format.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
350

chenych's avatar
chenych committed
351
Pre-training datasets are **incompatible** with the sharegpt format.