multimodal.qmd 4.24 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: MultiModal / Vision Language Models (BETA)
format:
  html:
    toc: true
    toc-depth: 3
---

## Supported Models

- [Mllama](#sec-mllama)
- [Llama4](#sec-llama4)
- [Pixtral](#sec-pixtral)
- [Llava-1.5](#sec-llava-15)
- [Mistral-Small-3.1](#sec-mistral-small-31)
- [Gemma-3](#sec-gemma-3)
- [Qwen2-VL](#sec-qwen2-vl)
- [Qwen2.5-VL](#sec-qwen25-vl)

## Usage

Multimodal support is limited and doesn't have full feature parity.

Here are the hyperparams you'll need to use to finetune a multimodal model.

```yaml
processor_type: AutoProcessor

skip_prepare_dataset: true
remove_unused_columns: false  # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false  # not yet supported with multimodal

chat_template:  # see in next section

# example dataset
datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
    type: chat_template
    split: train[:1%]
    field_messages: messages

# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinear
```

Please see [examples](https://github.com/axolotl-ai/axolotl/tree/main/examples) folder for full configs.

::: {.callout-warning}
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
:::

### Mllama {#sec-mllama}

```yaml
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct

chat_template: llama3_2_vision
```

### Llama4 {#sec-llama4}

```yaml
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct

chat_template: llama4
```

### Pixtral {#sec-pixtral}

```yaml
base_model: mistralai/Pixtral-12B-2409

chat_template: pixtral
```

### Llava-1.5 {#sec-llava-15}

```yaml
base_model: llava-hf/llava-1.5-7b-hf

chat_template: llava
```

### Mistral-Small-3.1 {#sec-mistral-small-31}

```yaml
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503

chat_template: mistral_v7_tekken
```

### Gemma-3 {#sec-gemma-3}

::: {.callout-tip}
The Gemma3-1B model is a text-only model, so please train as regular text model.
:::

For multi-modal 4B/12B/27B models, use the following config:

```yaml
base_model: google/gemma-3-4b-it

chat_template: gemma3
```

### Qwen2-VL {#sec-qwen2-vl}

```yaml
base_model: Qwen/Qwen2-VL-7B-Instruct

chat_template: qwen2_vl
```

### Qwen2.5-VL {#sec-qwen25-vl}

```yaml
base_model: Qwen/Qwen2.5-VL-7B-Instruct

chat_template: qwen2_vl  # same as qwen2-vl
```

## Dataset Format

For multi-modal datasets, we adopt an extended `chat_template` format similar to OpenAI's Message format.

- A message is a list of `role` and `content`.
- `role` can be `system`, `user`, `assistant`, etc.
- `content` is a list of `type` and (`text` or `image` or `path` or `url` or `base64`).

::: {.callout-note}
For backwards compatibility:

- If the dataset has a `images` or `image` column of `list[Image]`, it will be appended to the first `content` list as `{"type": "image", "image": ...}`. However, if the content already has a `{"type": "image"}` but no `image` key, it will be set the `image` key.
- If `content` is a string, it will be converted to a list with `type` as `text`.
:::

::: {.callout-tip}
For image loading, you can use the following keys within `content` alongside `"type": "image"`:

- `"path": "/path/to/image.jpg"`
- `"url": "https://example.com/image.jpg"`
- `"base64": "..."`
- `"image": PIL.Image`
:::

Here is an example of a multi-modal dataset:
```json
[
  {
    "messages": [
        {
            "role": "system",
            "content": [
              {"type": "text", "text": "You are a helpful assistant."}
              ]
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        },
        {
            "role": "assistant",
            "content": [
              {"type": "text", "text": "The image is a bee."}
            ]
        }
    ]
  }
]
```