bitsandbytes.md 15 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

-->

# bitsandbytes

[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.

4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.

20
21
22
23
24
25
This guide demonstrates how quantization can enable running
[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
on less than 16GB of VRAM and even on a free Google
Colab instance.

![comparison image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/comparison.png)
26
27
28
29
30
31
32
33
34
35
36
37
38
39

To use bitsandbytes, make sure you have the following libraries installed:

```bash
pip install diffusers transformers accelerate bitsandbytes -U
```

Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.

<hfoptions id="bnb">
<hfoption id="8-bit">

Quantizing a model in 8-bit halves the memory-usage:

40
41
42
43
44
45
46
47
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].

For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.

> [!TIP]
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.

48
```py
49
50
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
51
import torch
52
from diffusers import AutoModel
53
from transformers import T5EncoderModel
54

55
56
57
58
59
60
61
62
63
64
65
quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,)

text_encoder_2_8bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,)

66
transformer_8bit = AutoModel.from_pretrained(
67
    "black-forest-labs/FLUX.1-dev",
68
    subfolder="transformer",
69
70
    quantization_config=quant_config,
    torch_dtype=torch.float16,
71
72
73
)
```

74
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
75

76
```diff
77
transformer_8bit = AutoModel.from_pretrained(
78
79
80
81
82
83
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
+   torch_dtype=torch.float32,
)
```
84

85
Let's generate an image using our quantized models.
86

87
88
89
90
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the
CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.

```py
91
92
from diffusers import FluxPipeline

93
94
95
96
97
98
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer_8bit,
    text_encoder_2=text_encoder_2_8bit,
    torch_dtype=torch.float16,
    device_map="auto",
99
)
100
101
102
103
104
105
106
107
108
109
110

pipe_kwargs = {
    "prompt": "A cat holding a sign that says hello world",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 50,
    "max_sequence_length": 512,
}

image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
111
112
```

113
114
115
116
117
118
119
<div class="flex justify-center">
   <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/>
</div>

When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.

Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
120
121
122
123
124
125

</hfoption>
<hfoption id="4-bit">

Quantizing a model in 4-bit reduces your memory-usage by 4x:

126
127
128
129
130
131
132
133
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].

For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.

> [!TIP]
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.

134
```py
135
136
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
137
import torch
138
from diffusers import AutoModel
139
from transformers import T5EncoderModel
140

141
142
143
144
145
146
147
148
149
150
151
quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,)

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,)

152
transformer_4bit = AutoModel.from_pretrained(
153
    "black-forest-labs/FLUX.1-dev",
154
    subfolder="transformer",
155
156
    quantization_config=quant_config,
    torch_dtype=torch.float16,
157
158
159
)
```

160
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
161

162
```diff
163
transformer_4bit = AutoModel.from_pretrained(
164
165
166
167
168
169
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
+   torch_dtype=torch.float32,
)
```
170

171
Let's generate an image using our quantized models.
172

173
174
175
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.

```py
176
177
from diffusers import FluxPipeline

178
179
180
181
182
183
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer_4bit,
    text_encoder_2=text_encoder_2_4bit,
    torch_dtype=torch.float16,
    device_map="auto",
184
)
185
186
187
188
189
190
191
192
193
194
195

pipe_kwargs = {
    "prompt": "A cat holding a sign that says hello world",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 50,
    "max_sequence_length": 512,
}

image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
196
197
```

198
199
200
201
202
203
204
<div class="flex justify-center">
   <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/4bit.png"/>
</div>

When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.

Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220

</hfoption>
</hfoptions>

<Tip warning={true}>

Training with 8-bit and 4-bit weights are only supported for training *extra* parameters.

</Tip>

Check your memory footprint with the `get_memory_footprint` method:

```py
print(model.get_memory_footprint())
```

221
222
Note that this only tells you the memory footprint of the model params and does _not_ estimate the inference memory requirements.

223
224
225
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:

```py
226
from diffusers import AutoModel, BitsAndBytesConfig
227
228
229

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

230
model_4bit = AutoModel.from_pretrained(
231
    "hf-internal-testing/flux.1-dev-nf4-pkg", subfolder="transformer"
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
)
```

## 8-bit (LLM.int8() algorithm)

<Tip>

Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)!

</Tip>

This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion.

### Outlier threshold

An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).

To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:

```py
252
from diffusers import AutoModel, BitsAndBytesConfig
253
254
255
256
257

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True, llm_int8_threshold=10,
)

258
model_8bit = AutoModel.from_pretrained(
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quantization_config,
)
```

### Skip module conversion

For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:

```py
from diffusers import SD3Transformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True, llm_int8_skip_modules=["proj_out"],
)

model_8bit = SD3Transformer2DModel.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    subfolder="transformer",
    quantization_config=quantization_config,
)
```


## 4-bit (QLoRA algorithm)

<Tip>

Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

</Tip>

This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.


### Compute data type

To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:

```py
import torch
from diffusers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
```

### Normal Float 4 (NF4)

NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:

```py
311
312
313
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig

314
from diffusers import AutoModel
315
from transformers import T5EncoderModel
316

317
quant_config = TransformersBitsAndBytesConfig(
318
319
320
321
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

322
323
324
325
326
327
328
329
330
331
332
333
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

334
transformer_4bit = AutoModel.from_pretrained(
335
    "black-forest-labs/FLUX.1-dev",
336
    subfolder="transformer",
337
338
    quantization_config=quant_config,
    torch_dtype=torch.float16,
339
340
341
342
343
344
345
346
347
348
)
```

For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.

### Nested quantization

Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. 

```py
349
350
351
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig

352
from diffusers import AutoModel
353
from transformers import T5EncoderModel
354

355
quant_config = TransformersBitsAndBytesConfig(
356
357
358
359
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

360
361
362
363
364
365
366
367
368
369
370
371
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

372
transformer_4bit = AutoModel.from_pretrained(
373
    "black-forest-labs/FLUX.1-dev",
374
    subfolder="transformer",
375
376
    quantization_config=quant_config,
    torch_dtype=torch.float16,
377
378
379
380
381
)
```

## Dequantizing `bitsandbytes` models

382
Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model. 
383
384

```python
385
386
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
387

388
from diffusers import AutoModel
389
390
391
from transformers import T5EncoderModel

quant_config = TransformersBitsAndBytesConfig(
392
393
394
395
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

396
397
398
399
400
401
402
403
404
405
406
407
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

408
transformer_4bit = AutoModel.from_pretrained(
409
    "black-forest-labs/FLUX.1-dev",
410
    subfolder="transformer",
411
412
    quantization_config=quant_config,
    torch_dtype=torch.float16,
413
)
414
415
416

text_encoder_2_4bit.dequantize()
transformer_4bit.dequantize()
417
418
419
420
421
```

## Resources

* [End-to-end notebook showing Flux.1 Dev inference in a free-tier Colab](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4)
422
* [Training](https://github.com/huggingface/diffusers/blob/8c661ea586bf11cb2440da740dd3c4cf84679b85/examples/dreambooth/README_hidream.md#using-quantization)