conversation.qmd 6.7 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
title: Conversation
description: Conversation format for supervised fine-tuning.
order: 3
---

## chat_template

Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer's template, a supported template, or custom jinja2.

```{.json filename="data.jsonl"}
{"conversations": [{"role": "...", "content": "..."}]}
```

See [configs](../config.qmd) for full configs and supported templates.

### Migrating from sharegpt

Most configs can be adapted as follows:

```yaml
# old
chat_template: chatml
datasets:
  - path: ...
    type: sharegpt
    conversation: chatml

# new (if using tokenizer's chat_template)
datasets:
  - path: ...
    type: chat_template

    field_messages: conversations
    message_property_mappings:
      role: from
      content: value

# new (if setting a new chat_template like chatml, gemma, etc)
chat_template: chatml
datasets:
  - path: ...
    type: chat_template

    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
```

We recommend checking the below examples for other usecases.

### Examples

1. (Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.

```yaml
datasets:
  - path: ...
    type: chat_template
    roles_to_train:
    train_on_eos:
```

::: {.callout-tip}
If you receive an error like "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null.", it means the tokenizer does not have a default `chat_template`. Follow the examples below instead to set a custom `chat_template`.
:::

2. Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.

```yaml
chat_template: gemma # this overwrites the tokenizer's chat_template
datasets:
  - path: ...
    type: chat_template
    roles_to_train: ["assistant"]  # default value
```

3. Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.

```yaml
chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template
datasets:
  - path: ...
    type: chat_template
```

4. Using a custom jinja template on OpenAI messages format, training on all assistant messages.

```yaml
# chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|system|>' + '\n' + message['content'] + '<|end|>' + '\n'}}{% elif (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif message['role'] == 'assistant' %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}"

datasets:
  - path: ...
    type: chat_template
```

::: {.callout-important}
Please make sure that your `tokenizer.eos_token` is same as EOS (End-of-Sequence) token in template. Otherwise, set `eos_token` under `special_tokens: `.
:::

5. If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn.

```yaml
eot_tokens:
  - "[/INST]"
  # - "[/SYSTEM_PROMPT]"

datasets:
  - path: ...
    type: chat_template

    # optional
    train_on_eot: turn  # defaults read from train_on_eos (which defaults to turn)
```

::: {.callout-tip}
See [config documentation](../config.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens.
:::

::: {.callout-note}
Using `eot_tokens` requires each token that exists in `chat_template` to be a single token in the tokenizer. Otherwise, the tokenizer will split the token and cause unexpected behavior.

You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config.qmd) for more details.
:::

6. Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`.

```yaml
eot_tokens:
  - "[/INST]"
  # ...

datasets:
  - path: ...
    type: chat_template

    train_on_eos: last
    train_on_eot: turn
```

::: {.callout-tip}
If EOS token only appears at the end of a prompt, `train_on_eos: last` is equivalent to `train_on_eos: turn`. Therefore, generally, you can leave them to their defaults and omit them.
:::


7. (Advanced) Using fine-grained control over tokens and turns to train in a conversation

For a data sample that looks like:

```{.json filename="data.jsonl"}
{
  "conversations": [
    {"from": "system", "value": "You are an AI assistant.", "train": false},
    {"from": "human", "value": "Hello", "train": false},
    {"from": "assistant", "value": "Hello", "train": true},
    {"from": "human", "value": "How are you?", "train": true},
    {
      "from": "assistant",
      "value": "I'm doing very well, thank you!",
      "train_detail": [
        {"begin_offset": 0, "end_offset": 8, "train": false},
        {"begin_offset": 9, "end_offset": 18, "train": true},
        {"begin_offset": 19, "end_offset": 30, "train": false},
      ],
    },
    {
        "from": "human",
        "value": "I'm doing very well, thank you!",
        "train": true,
    },
    {"from": "assistant", "value": "Hi there!", "train": true}
  ]
}
```

The configuration would look like:

```yaml
datasets:
  - path: ...
    type: chat_template
    chat_template: tokenizer_default
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
    roles_to_train: []
    train_on_eos: turn
    message_field_training: train
    message_field_training_detail: train_detail
```

::: {.callout-tip}
It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
:::

8. (For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.

```yaml
datasets:
  - path: ...
    type: chat_template
    chat_template: qwen3
    split_thinking: true
```

For example, a content can look like:

```json
{
  "content": "<think>Some thinking outputs</think>Output after thinking."
}
```

After split, it will look like:

```json
{
  "reasoning_content": "Some thinking outputs",
  "content": "Output after thinking..."
}
```


## sharegpt

::: {.callout-important}
ShareGPT is deprecated!. Please see [chat_template](#chat_template) section.
:::

## pygmalion

```{.json filename="data.jsonl"}
{"conversations": [{"role": "...", "value": "..."}]}
```