index.qmd 17.3 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
---
title: Dataset Formats
description: Guide to Dataset Formats in Axolotl
back-to-top-navigation: true
toc: true
toc-depth: 5
---


Axolotl is a training framework that aims to make the process convenient yet flexible to users by simply passing a config yaml file.

As there are a lot of available options in Axolotl, this guide aims to provide an simplify the user experience to choosing the proper choice.

Axolotl supports 3 kinds of training methods: pre-training, supervised fine-tuning, and preference-based post-training (e.g. DPO, ORPO, PRMs). Each method has their own dataset format which are described below.

::: {.callout-tip}

This guide will mainly use JSONL as an introduction. Please refer to the [dataset loading docs](../dataset_loading.qmd) to understand how to load datasets from other sources.

For `pretraining_dataset:` specifically, please refer to the [Pre-training section](#pre-training).
:::

## Pre-training

When aiming to train on large corpora of text datasets, pre-training is your go-to choice. Due to the size of these datasets, downloading the entire-datasets before beginning training would be prohibitively time-consuming. Axolotl supports [streaming](https://huggingface.co/docs/datasets/en/stream) to only load batches into memory at a time.

A sample format for a pre-training dataset is as follows:

```json
{"text": "first row"}
{"text": "second row"}
...
```

It is typically recommended to save your dataset as `.jsonl` due to its flexibility and simplicity.

Axolotl supports loading from a Hugging Face hub repo or from local files.

### Pre-training from Hugging Face hub datasets

As an example, to train using a Hugging Face dataset `hf_org/name`, you can pass the following config:

```yaml
pretraining_dataset: hf_org/name
```

### Pre-training from local dataset files

Given a few corpus files: `A.jsonl`, `B.jsonl`, and `C.jsonl`, your config will look like the below:

```yaml
pretraining_dataset:
  - path: json
    data_files:
      - A.jsonl
      - B.jsonl
      - C.jsonl
```

While we recommend `.jsonl`, you can also use the other formats (`csv`, `parquet`, `arrow`, `SQL`, `Webdataset`) that are supported by [`Dataset.load_dataset`](https://huggingface.co/docs/datasets/loading#local-and-remote-files)

### Pre-training without streaming

On the rare case that the dataset is small and can be loaded entirely into memory, another approach to running pre-training is to use the `completion` format. This would mean that the entire dataset is pre-tokenized instead of on-demand in streaming.

One benefit of this is that the tokenization can be performed separately on a CPU-only machine, and then transferred to a GPU machine for training to save costs.

From Hugging Face:

```yaml
datasets:
  - path: hf_org/name
    type: completion
```

From local files:

```yaml
datasets:
  - path: A.jsonl
    type: completion

  - path: B.jsonl
    type: completion
```

::: {.callout-important}
For `completion` only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts. If you are interested in having this for `pretraining_dataset` too, please let us know or help make a PR!
:::

### Pre-training dataset configuration tips

#### Setting max_steps

When using streaming for large datasets, Axolotl does not know in advance how large the dataset is and does not know when to stop.

Therefore, it is necessary to set `max_steps: int` in your config for pre-training to run, so that Axolotl knows when to stop training.

One step is equal to `sequence_len * micro_batch_size * gradient_accumulation_steps * total_num_gpus` tokens.

#### Group_by_length

It is recommended to leave this off if downloading from Hugging Face hub as it would download the entire dataset which can be very large.

### Reference

Please see docs [here](pretraining.qmd).

## Supervised fine-tuning (SFT)

Supervised fine-tuning is the process of training models to respond to an instruction or chat input.

As there are a wide variety of dataset formats, Axolotl tries to support a majority of the formats available in public datasets.

Axolotl provides four approaches for loading datasets, however, it's easier to work backwards from the dataset you have available to figure out which approach to use.

A flow chart is as follows:

1. Do you already have the dataset tokenized? If yes, check [Pre-Tokenized Dataset](#pre-tokenized-dataset).

2. Do you want to format the dataset yourself and manually choose each section to mask? If yes, check [Template Free Dataset](#template-free-dataset)

3. Is your dataset in a "conversation" format, containing a `list[messages]`? If yes, check [Conversation Dataset](#conversation-dataset)

4. Is your dataset in an "instruct" format, containing `{ instruction, response }`? If yes, check [Instruction Dataset](#instruction-dataset)

If you went through the flow chart and did not find one that matches, it is recommended to preprocess your dataset into one of the above or create a thread on Github Discussion.

::: {.callout-tip}
You can mix and match within each approach or across approaches to train a model on a variety of datasets.
:::

### Pre-Tokenized Dataset

We suggest this approach when you want to bring your own tokenized dataset.

Axolotl expects the dataset to have three keys:

- `input_ids`: from tokenizing formatted prompt
- `attention_mask`: for masking padding. If you don't add padding, it would be equal to `len(input_ids) * [1]`
- `labels`: this is the same as `input_ids`, however, if you want to mask certain tokens, you would set those indices to `-100`.

::: {.callout-tip}
Make sure to add BOS/EOS tokens to your prompt and mask it appropriately.
:::

A config for this would look like:

```yaml
datasets:
  - path: A.jsonl
    type:
```

::: {.callout-note}
`type: ` is empty!
:::

Reference: [Pre-Tokenized Dataset Documentation](tokenized.qmd).

### Template Free Dataset

We reccomend this approach when you want granular control over the prompt formatting, special tokens, and masking, whilst letting Axolotl handle the tokenization. This is very useful if your dataset has unique prompts that differ across samples and where one single general template wouldn't suffice.

In the example below, you could see that there is no proper structure. At the same time, it's very flexible as there are no constraints on how your prompt can look.

```json
{
    "segments": [
        {
            "label": true,
            "text": "<s>Hello\n"
        },
        {
            "label": true,
            "text": "hi there!. "
        },
        {
            "label": false,
            "text": "goodbye "
        },
        {
            "label": true,
            "text": "farewell</s>"
        }
    ]
}
```

Each prompt must be have a key called `segments` which is a list of `{ text, label }`.

```yaml
datasets:
  - path: A.jsonl
    type: input_output
```

Reference: [Template Free Documentation](template_free.qmd).

### Conversation Dataset

`conversation` messages are a list of messages which usually contain a `role` and `content` key.

::: {.callout-tip}
Fun fact: Axolotl synonymously refers to "chat" messages as `conversation` messages due to how FastChat initially used this term to build a widely used [fastchat conversation](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py) method for formatting chat messages prior to the creation of `chat_templates`.
:::

#### What are `chat_templates`?

The current most popular and convenient method for inference is to use `chat_templates` for formatting prompts. Axolotl supports using `chat_templates` for training to ensure that the model performs in the same environment as in inference.

Here's a quick rundown on `chat_template`: A `chat_template` is a Jinja2 template which formats a list of messages into a prompt.

An example of a prompt formatted into a popular template called ChatML can be seen below:

Single prompt (pretty-printed):
```json
{
    "messages": [
        {
            "role": "user",
            "content": "Hi"
        },
        {
            "role": "assistant",
            "content": "How can I help you?"
        },
        {
            "role": "user",
            "content": "Can you add 3+5?"
        },
        {
            "role": "assistant",
            "content": "The answer is 8."
        }
    ]
}
```

The ChatML template is as follows:
```jinja2
{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
```

The above prompt formatted into this template will result in:

```
<|im_start|>user
Hi<|im_end|>
<|im_start|>assistant
How can I help you?<|im_end|>
<|im_start|>user
Can you add 3+5?<|im_end|>
<|im_start|>assistant
The answer is 8.<|im_end|>
```

By using delimiters (`<|im_start|>` and `<|im_end|>`), a prompt separates different speakers which helps the model identify which portion belongs to whom.

#### Common Conversation Dataset formats

Older conversation datasets with the following format are colloquially called `sharegpt` datasets.

```json
{"conversations": [{"from": "...", "value": "..."}]}
```

Newer conversation datasets usually follow the OpenAI format.

```json
{"messages": [{"role": "...", "content": "..."}]}
```

Axolotl supports both as well as allowing customization of any kind of key.

#### Chat Template Usage

To properly use this method, it is important to identify three things:

1. Which `chat_template` would you use?

2. What are the keys in your dataset, and what are the possible roles? For example, in OpenAI format, the keys would be `messages`, `role`, and `content`, respectively, whereas the possible roles are `system`, `user`, and `assistant`.

3. What do you want to mask? For instance, only assistant messages, only last message, or nothing.

##### Choosing a `chat_template`

There are a lot of `chat_templates` out there. Axolotl supports the common ones: [supported chat templates](https://github.com/axolotl-ai-cloud/axolotl/blob/860609392184cf62a7e0ca676658b170e059ce6c/src/axolotl/utils/chat_templates.py#L17). For example, to use ChatML, it would be `chat_template: chatml`.

However, it is also possible to use the already configured template within the tokenizer by specifying `chat_template: tokenizer_default`. If you want a fallback (in case some tokenizer does not have it pre-configured), you can do `chat_template: tokenizer_default_fallback_chatml` to fallback to the ChatML template if a tokenizer template was not found.

One last but powerful approach is to bring your own template. This can be set via:

```yaml
chat_template_jinja: # your template
```

##### Setting `chat_template` dataset keys

We currently default to OpenAI format for dataset keys, so if that's your current dataset format, there's nothing to do here.

If your dataset format is different, here are the keys you should check (with their defaults):

```yaml
datasets:
    ...
    field_messages: messages  # this should point to the key containing the list of conversations
    message_property_mappings:  # this is a mapping from keys in your dataset to keys in chat_template
      role: role
      content: content
```

In some `chat_templates` (e.g. [Gemma](https://huggingface.co/google/gemma-2b-it/blob/main/tokenizer_config.json#L1507)), the roles are hardcoded to `user` and `assistant`. Consequently, you may find it necessary to map the roles in your dataset to these above. We currently have some defaults that should work for common datasets, but if you get a `KeyError`, it would be necessary to add mapping for your roles. Here is an example of how it would look like:

```yaml
datasets:
    ...
    roles:
      assistant:
        - gpt
        - model
      user:
        - human
```

In the example above, all `gpt` and `model` values are converted to `assistant`. All `human` values are converted to `user.`

##### Handling masking

The common use case for `chat_template` is for chat messages, therefore, it is common to mask all non-assistant messages. Assistant messages refer to the bot messages that you want the model to learn on.

To train on all `assistant` messages, you would set the following configs.

```yaml
datasets:
    ...
    roles_to_train: ["assistant"]
    train_on_eos: "turn"
```

The `train_on_eos` config means that it would mask all EOS tokens for turns that aren't assistant-turns. The other options are: `all` and `last` to choose which EOS to train on.

Perhaps, you want to train on `assistant` and `narrator` roles, you can simply add `narrator` to the list of `roles_to_train`. You would also need to add it to the mapping of `roles` above.

```yaml
datasets:
    ...
    roles_to_train: ["assistant", "narrator"]
    roles:
      assistant:
        - gpt
        - model
      user:
        - human
      narrator: ["narrator"]
```

::: {.callout-tip}
As chat_templates may use hardcoded EOS/EOT tokens that are different from the tokenizer's EOS, it is highly recommended to set them. For example, `ChatML` uses `<|im_end|>` to end turns.

```yaml
special_tokens:
  eos_token: <|im_end|>
```

:::

##### Applying `chat_template`

Once all the above steps are completed, you could combine all these configs together to form a bespoke configuration for your custom dataset.

```yaml
datasets:
  - path: A.jsonl
    type: chat_template

    # step 1
    chat_template: chatml

    # step 2
    field_messages: messages
    message_property_mappings:
      role: role
      content: content

    roles:
      assistant:
        - gpt
        - model
        - assistant
      user:
        - human
        - user

    # step 3
    roles_to_train: ["assistant"]
    train_on_eos: "turn"

special_tokens:
  eos_token: <|im_end|>
```

If this config were to be applied to the sample dataset above, the output would look as such (which can be retrieved via `axolotl preprocess config.yaml --debug`):

```
<|im_start|>(-100, 128256) user(-100, 882)
(-100, 198) Hi(-100, 13347) <|im_end|>(-100, 128257)
(-100, 198) <|im_start|>(-100, 128256) assistant(-100, 78191)
(-100, 198) How(4438, 4438)  can(649, 649)  I(358, 358)  help(1520, 1520)  you(499, 499) ?(30, 30) <|im_end|>(128257, 128257)
(-100, 198) <|im_start|>(-100, 128256) user(-100, 882)
(-100, 198) Can(-100, 6854)  you(-100, 499)  add(-100, 923)  (-100, 220) 3(-100, 18) +(-100, 10) 5(-100, 20) ?(-100, 30) <|im_end|>(-100, 128257)
(-100, 198) <|im_start|>(-100, 128256) assistant(-100, 78191)
(-100, 198) The(791, 791)  answer(4320, 4320)  is(374, 374)  (220, 220) 8(23, 23) .(13, 13) <|im_end|>(128257, 128257)
(-100, 198)
```

The first number refers to the label, the second refers to the `token_id`. For example, `-100` labels appear on non-assistant portions, meaning that they are masked during. For assistant portions, the label is the same as the `token_id`.

::: {.callout-note}

If during `preprocess`, there are a lot of warnings of `Could not find content __ boundary`, please check the FAQ section for [chat_templates](../faq.qmd#chat-templates).

:::

#### Reference

Please see docs [here](conversation.qmd).

### Instruction Dataset

Instruction datasets are used to train instruction-following models and comprise a prompt, containing an instruction, and a single response. In contrast to chat datasets which may be multi-turn, instruct datasets are typically single-turn.

An example is of a common format called Alpaca:
```json
{"instruction": "...", "input": "...", "output": "..."}
```

Using those keys, a prompt can be built based on it.
```
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}
```

This can be configured as such:
```yaml
datasets:
  - path: A.jsonl
    type: alpaca
```

Axolotl supports many kinds of instruction dataset. All of them can be found in the [Instruction Dataset Documentation](inst_tune.qmd) with their respective type and sample row format.

#### Custom Instruct Prompt Format

Due to the myriad possibilities of instruction formats, Axolotl allows customizing your own instruction format without having to dive into the code directly.

In the example below, a sample row is used to output in `mistral_v1` format.
```json
{"input": "...", "output": "..."}
```

```yaml
datasets:
  - path: repo
    type:
      system_prompt: ""

      field_system:
      field_instruction: input
      field_input:
      field_output: output

      # multi-line example with input
      format: |-
        [INST] {instruction} {input} [/INST]

      # single-line example without input
      no_input_format: "[INST] {instruction} [/INST]"
```

The config sets that the `field_instruction` is actually named `input`, and the `field_input` is empty as we don't have an `input` in this sample. Generally, `instruction` can be thought as the question to the model, and `input` as the additional information with `output` being the response. It is not necessary to have an `input` nor `system`. In the end, the most important part is to understand what format you want it to look like and how you can customize this to your use case.

Reference: [Custom Instruct Prompt Format Documentation](inst_tune.qmd#how-to-add-custom-prompt-format).

## Reinforcement Learning from Human Feedback (RLHF)

As there are multiple RLHF methods with their own dataset requirements. Please see [RLHF documentation](../rlhf.qmd) for more detail.