README.md 13.3 KB
Newer Older
chenych's avatar
chenych committed
1
2
The [dataset_info.json](dataset_info.json) contains all available datasets. If you are using a custom dataset, please **make sure** to add a *dataset description* in `dataset_info.json` and specify `dataset: dataset_name` before training to use it.

mashun1's avatar
mashun1 committed
3
4
The `dataset_info.json` file should be put in the `dataset_dir` directory. You can change `dataset_dir` to use another directory. The default value is `./data`.

chenych's avatar
chenych committed
5
Currently we support datasets in **alpaca** and **sharegpt** format.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
6
7
8

```json
"dataset_name": {
chenych's avatar
chenych committed
9
10
11
12
  "hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url, file_name and cloud_file_name)",
  "ms_hub_url": "the name of the dataset repository on the Model Scope hub. (if specified, ignore script_url, file_name and cloud_file_name)",
  "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name and cloud_file_name)",
  "cloud_file_name": "the name of the dataset file in s3/gcs cloud storage. (if specified, ignore file_name)",
chenych's avatar
chenych committed
13
14
15
  "file_name": "the name of the dataset folder or dataset file in this directory. (required if above are not specified)",
  "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
  "ranking": "whether the dataset is a preference dataset or not. (default: False)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
16
  "subset": "the name of the subset. (optional, default: None)",
chenych's avatar
chenych committed
17
  "split": "the name of dataset split to be used. (optional, default: train)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
18
  "folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
chenych's avatar
chenych committed
19
  "num_samples": "the number of samples in the dataset to be used. (optional, default: None)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
20
21
22
23
24
25
26
  "columns (optional)": {
    "prompt": "the column name in the dataset containing the prompts. (default: instruction)",
    "query": "the column name in the dataset containing the queries. (default: input)",
    "response": "the column name in the dataset containing the responses. (default: output)",
    "history": "the column name in the dataset containing the histories. (default: None)",
    "messages": "the column name in the dataset containing the messages. (default: conversations)",
    "system": "the column name in the dataset containing the system prompts. (default: None)",
chenych's avatar
chenych committed
27
28
    "tools": "the column name in the dataset containing the tool description. (default: None)",
    "images": "the column name in the dataset containing the image inputs. (default: None)",
luopl's avatar
luopl committed
29
    "videos": "the column name in the dataset containing the videos inputs. (default: None)",
chenych's avatar
chenych committed
30
    "audios": "the column name in the dataset containing the audios inputs. (default: None)",
chenych's avatar
chenych committed
31
32
33
    "chosen": "the column name in the dataset containing the chosen answers. (default: None)",
    "rejected": "the column name in the dataset containing the rejected answers. (default: None)",
    "kto_tag": "the column name in the dataset containing the kto tags. (default: None)"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
34
35
36
37
38
39
40
41
42
43
44
45
46
  },
  "tags (optional, used for the sharegpt format)": {
    "role_tag": "the key in the message represents the identity. (default: from)",
    "content_tag": "the key in the message represents the content. (default: value)",
    "user_tag": "the value of the role_tag represents the user. (default: human)",
    "assistant_tag": "the value of the role_tag represents the assistant. (default: gpt)",
    "observation_tag": "the value of the role_tag represents the tool results. (default: observation)",
    "function_tag": "the value of the role_tag represents the function call. (default: function_call)",
    "system_tag": "the value of the role_tag represents the system prompt. (default: system, can override system column)"
  }
}
```

chenych's avatar
chenych committed
47
48
49
50
51
## Alpaca Format

### Supervised Fine-Tuning Dataset

* [Example dataset](alpaca_en_demo.json)
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
52

mashun1's avatar
mashun1 committed
53
54
55
In supervised fine-tuning, the `instruction` column will be concatenated with the `input` column and used as the user prompt, then the user prompt would be `instruction\ninput`. The `output` column represents the model response.

For reasoning models, if the dataset contains chain-of-thought (CoT), the CoT needs to be placed in the model responses, such as `<think>cot</think>output`.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
56

chenych's avatar
chenych committed
57
58
59
The `system` column will be used as the system prompt if specified.

The `history` column is a list consisting of string tuples representing prompt-response pairs in the history messages. Note that the responses in the history **will also be learned by the model** in supervised fine-tuning.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
60
61
62
63

```json
[
  {
mashun1's avatar
mashun1 committed
64
65
    "instruction": "user instruction (required)",
    "input": "user input (optional)",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
66
67
68
    "output": "model response (required)",
    "system": "system prompt (optional)",
    "history": [
mashun1's avatar
mashun1 committed
69
70
      ["user instruction in the first round (optional)", "model response in the first round (optional)"],
      ["user instruction in the second round (optional)", "model response in the second round (optional)"]
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
71
72
73
74
75
    ]
  }
]
```

chenych's avatar
chenych committed
76
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
77
78
79

```json
"dataset_name": {
chenych's avatar
chenych committed
80
  "file_name": "data.json",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
81
82
83
84
85
86
87
88
89
90
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "system": "system",
    "history": "history"
  }
}
```

mashun1's avatar
mashun1 committed
91
92
93
94
95
> [!TIP]  
> If the model has reasoning capabilities but the dataset does not contain chain-of-thought (CoT), LLaMA-Factory will automatically add empty CoT to the data. When `enable_thinking` is `True` (slow thinking), the empty CoT will be added to the model responses and loss computation will be considered; otherwise (fast thinking), it will be added to the user prompts and loss computation will be ignored. Please keep the `enable_thinking` parameter consistent during training and inference.
>
> If you want to train data containing CoT with slow thinking and data without CoT with fast thinking, you can set `enable_thinking` to `None`. However, this feature is relatively complicated and should be used with caution.

chenych's avatar
chenych committed
96
97
### Pre-training Dataset

chenych's avatar
chenych committed
98
- [Example dataset](c4_demo.jsonl)
chenych's avatar
chenych committed
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121

In pre-training, only the `text` column will be used for model learning.

```json
[
  {"text": "document"},
  {"text": "document"}
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "columns": {
    "prompt": "text"
  }
}
```

### Preference Dataset

luopl's avatar
luopl committed
122
Preference datasets are used for reward modeling, DPO training, ORPO and SimPO training.
chenych's avatar
chenych committed
123
124
125
126
127
128

It requires a better response in `chosen` column and a worse response in `rejected` column.

```json
[
  {
mashun1's avatar
mashun1 committed
129
130
    "instruction": "user instruction (required)",
    "input": "user input (optional)",
chenych's avatar
chenych committed
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
    "chosen": "chosen answer (required)",
    "rejected": "rejected answer (required)"
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "ranking": true,
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "chosen": "chosen",
    "rejected": "rejected"
  }
}
```

### KTO Dataset

luopl's avatar
luopl committed
154
An additional column `kto_tag` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.
chenych's avatar
chenych committed
155

luopl's avatar
luopl committed
156
### Multimodal Image Dataset
chenych's avatar
chenych committed
157

luopl's avatar
luopl committed
158
An additional column `images` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
159

luopl's avatar
luopl committed
160
### Multimodal Video Dataset
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
161

luopl's avatar
luopl committed
162
An additional column `videos` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
163

chenych's avatar
chenych committed
164
165
166
167
### Multimodal Audio Dataset

An additional column `audios` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.

chenych's avatar
chenych committed
168
## Sharegpt Format
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
169

chenych's avatar
chenych committed
170
### Supervised Fine-Tuning Dataset
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
171

chenych's avatar
chenych committed
172
173
174
175
176
- [Example dataset](glaive_toolcall_en_demo.json)

Compared to the alpaca format, the sharegpt format allows the datasets have **more roles**, such as human, gpt, observation and function. They are presented in a list of objects in the `conversations` column.

Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions.
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
177
178
179
180
181
182
183

```json
[
  {
    "conversations": [
      {
        "from": "human",
mashun1's avatar
mashun1 committed
184
        "value": "user instruction"
chenych's avatar
chenych committed
185
186
187
188
189
190
191
192
      },
      {
        "from": "function_call",
        "value": "tool arguments"
      },
      {
        "from": "observation",
        "value": "tool result"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
193
194
195
196
197
198
199
200
201
202
203
204
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ],
    "system": "system prompt (optional)",
    "tools": "tool description (optional)"
  }
]
```

chenych's avatar
chenych committed
205
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
206
207
208

```json
"dataset_name": {
chenych's avatar
chenych committed
209
210
  "file_name": "data.json",
  "formatting": "sharegpt",
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
211
212
213
214
  "columns": {
    "messages": "conversations",
    "system": "system",
    "tools": "tools"
chenych's avatar
chenych committed
215
216
217
218
  }
}
```

luopl's avatar
luopl committed
219
220
221
222
### Pre-training Dataset

Not yet supported, please use the [alpaca](#alpaca-format) format.

chenych's avatar
chenych committed
223
224
225
226
227
228
229
230
231
232
233
234
### Preference Dataset

- [Example dataset](dpo_en_demo.json)

Preference datasets in sharegpt format also require a better message in `chosen` column and a worse message in `rejected` column.

```json
[
  {
    "conversations": [
      {
        "from": "human",
mashun1's avatar
mashun1 committed
235
        "value": "user instruction"
chenych's avatar
chenych committed
236
237
238
239
240
241
242
      },
      {
        "from": "gpt",
        "value": "model response"
      },
      {
        "from": "human",
mashun1's avatar
mashun1 committed
243
        "value": "user instruction"
chenych's avatar
chenych committed
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "chosen answer (required)"
    },
    "rejected": {
      "from": "gpt",
      "value": "rejected answer (required)"
    }
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "ranking": true,
  "columns": {
    "messages": "conversations",
    "chosen": "chosen",
    "rejected": "rejected"
  }
}
```

luopl's avatar
luopl committed
273
274
275
276
277
278
279
280
281
282
283
284
### KTO Dataset

- [Example dataset](kto_en_demo.json)

KTO datasets require a extra `kto_tag` column containing the boolean human feedback.

```json
[
  {
    "conversations": [
      {
        "from": "human",
mashun1's avatar
mashun1 committed
285
        "value": "user instruction"
luopl's avatar
luopl committed
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ],
    "kto_tag": "human feedback [true/false] (required)"
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "conversations",
    "kto_tag": "kto_tag"
  }
}
```

### Multimodal Image Dataset

- [Example dataset](mllm_demo.json)

chenych's avatar
chenych committed
314
Multimodal image datasets require an `images` column containing the paths to the input images.
luopl's avatar
luopl committed
315
316
317
318
319
320
321
322
323

The number of images should be identical to the `<image>` tokens in the conversations.

```json
[
  {
    "conversations": [
      {
        "from": "human",
mashun1's avatar
mashun1 committed
324
        "value": "<image>user instruction"
luopl's avatar
luopl committed
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ],
    "images": [
      "image path (required)"
    ]
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "conversations",
    "images": "images"
  }
}
```

### Multimodal Video Dataset

- [Example dataset](mllm_video_demo.json)

Multimodal video datasets require a `videos` column containing the paths to the input videos.

The number of videos should be identical to the `<video>` tokens in the conversations.

```json
[
  {
    "conversations": [
      {
        "from": "human",
mashun1's avatar
mashun1 committed
365
        "value": "<video>user instruction"
luopl's avatar
luopl committed
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ],
    "videos": [
      "video path (required)"
    ]
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "conversations",
    "videos": "videos"
  }
}
```

chenych's avatar
chenych committed
392
393
394
395
396
397
398
399
400
401
402
403
404
405
### Multimodal Audio Dataset

- [Example dataset](mllm_audio_demo.json)

Multimodal audio datasets require an `audios` column containing the paths to the input audios.

The number of audios should be identical to the `<audio>` tokens in the conversations.

```json
[
  {
    "conversations": [
      {
        "from": "human",
mashun1's avatar
mashun1 committed
406
        "value": "<audio>user instruction"
chenych's avatar
chenych committed
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ],
    "audios": [
      "audio path (required)"
    ]
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "conversations",
    "audios": "audios"
  }
}
```

chenych's avatar
chenych committed
433
434
435
436
437
438
439
440
441
442
443
444
445
446
### OpenAI Format

The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt.

```json
[
  {
    "messages": [
      {
        "role": "system",
        "content": "system prompt (optional)"
      },
      {
        "role": "user",
mashun1's avatar
mashun1 committed
447
        "content": "user instruction"
chenych's avatar
chenych committed
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
      },
      {
        "role": "assistant",
        "content": "model response"
      }
    ]
  }
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "messages"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
466
467
  },
  "tags": {
chenych's avatar
chenych committed
468
469
470
471
472
    "role_tag": "role",
    "content_tag": "content",
    "user_tag": "user",
    "assistant_tag": "assistant",
    "system_tag": "system"
Rayyyyy's avatar
V0.6.3  
Rayyyyy committed
473
474
475
  }
}
```