"docs/source/api/vscode:/vscode.git/clone" did not exist on "2288098ba667b244c94913bde0e9b07182a475c4"
bitsandbytes.md 16.2 KB
Newer Older
Aryan's avatar
Aryan committed
1
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

-->

# bitsandbytes

[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.

4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.

20
21
22
23
24
25
This guide demonstrates how quantization can enable running
[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
on less than 16GB of VRAM and even on a free Google
Colab instance.

![comparison image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/comparison.png)
26
27
28
29
30
31
32
33
34
35
36
37
38
39

To use bitsandbytes, make sure you have the following libraries installed:

```bash
pip install diffusers transformers accelerate bitsandbytes -U
```

Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.

<hfoptions id="bnb">
<hfoption id="8-bit">

Quantizing a model in 8-bit halves the memory-usage:

40
41
42
43
44
45
46
47
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].

For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.

> [!TIP]
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.

48
```py
49
50
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
51
import torch
52
from diffusers import AutoModel
53
from transformers import T5EncoderModel
54

55
56
57
58
59
60
61
62
63
64
65
quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,)

text_encoder_2_8bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,)

66
transformer_8bit = AutoModel.from_pretrained(
67
    "black-forest-labs/FLUX.1-dev",
68
    subfolder="transformer",
69
70
    quantization_config=quant_config,
    torch_dtype=torch.float16,
71
72
73
)
```

74
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
75

76
```diff
77
transformer_8bit = AutoModel.from_pretrained(
78
79
80
81
82
83
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
+   torch_dtype=torch.float32,
)
```
84

85
Let's generate an image using our quantized models.
86

87
88
89
90
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the
CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.

```py
91
92
from diffusers import FluxPipeline

93
94
95
96
97
98
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer_8bit,
    text_encoder_2=text_encoder_2_8bit,
    torch_dtype=torch.float16,
    device_map="auto",
99
)
100
101
102
103
104
105
106
107
108
109
110

pipe_kwargs = {
    "prompt": "A cat holding a sign that says hello world",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 50,
    "max_sequence_length": 512,
}

image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
111
112
```

113
114
115
116
117
118
119
<div class="flex justify-center">
   <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/>
</div>

When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.

Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
120
121
122
123
124
125

</hfoption>
<hfoption id="4-bit">

Quantizing a model in 4-bit reduces your memory-usage by 4x:

126
127
128
129
130
131
132
133
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].

For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.

> [!TIP]
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.

134
```py
135
136
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
137
import torch
138
from diffusers import AutoModel
139
from transformers import T5EncoderModel
140

141
142
143
144
145
146
147
148
149
150
151
quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,)

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,)

152
transformer_4bit = AutoModel.from_pretrained(
153
    "black-forest-labs/FLUX.1-dev",
154
    subfolder="transformer",
155
156
    quantization_config=quant_config,
    torch_dtype=torch.float16,
157
158
159
)
```

160
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
161

162
```diff
163
transformer_4bit = AutoModel.from_pretrained(
164
165
166
167
168
169
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
+   torch_dtype=torch.float32,
)
```
170

171
Let's generate an image using our quantized models.
172

173
174
175
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.

```py
176
177
from diffusers import FluxPipeline

178
179
180
181
182
183
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer_4bit,
    text_encoder_2=text_encoder_2_4bit,
    torch_dtype=torch.float16,
    device_map="auto",
184
)
185
186
187
188
189
190
191
192
193
194
195

pipe_kwargs = {
    "prompt": "A cat holding a sign that says hello world",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 50,
    "max_sequence_length": 512,
}

image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
196
197
```

198
199
200
201
202
203
204
<div class="flex justify-center">
   <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/4bit.png"/>
</div>

When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.

Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
205
206
207
208

</hfoption>
</hfoptions>

Steven Liu's avatar
Steven Liu committed
209
210
> [!WARNING]
> Training with 8-bit and 4-bit weights are only supported for training *extra* parameters.
211
212
213
214
215
216
217

Check your memory footprint with the `get_memory_footprint` method:

```py
print(model.get_memory_footprint())
```

218
219
Note that this only tells you the memory footprint of the model params and does _not_ estimate the inference memory requirements.

220
221
222
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:

```py
223
from diffusers import AutoModel, BitsAndBytesConfig
224
225
226

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

227
model_4bit = AutoModel.from_pretrained(
228
    "hf-internal-testing/flux.1-dev-nf4-pkg", subfolder="transformer"
229
230
231
232
233
)
```

## 8-bit (LLM.int8() algorithm)

Steven Liu's avatar
Steven Liu committed
234
235
> [!TIP]
> Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)!
236
237
238
239
240
241
242
243
244
245

This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion.

### Outlier threshold

An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).

To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:

```py
246
from diffusers import AutoModel, BitsAndBytesConfig
247
248
249
250
251

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True, llm_int8_threshold=10,
)

252
model_8bit = AutoModel.from_pretrained(
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quantization_config,
)
```

### Skip module conversion

For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:

```py
from diffusers import SD3Transformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True, llm_int8_skip_modules=["proj_out"],
)

model_8bit = SD3Transformer2DModel.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    subfolder="transformer",
    quantization_config=quantization_config,
)
```


## 4-bit (QLoRA algorithm)

Steven Liu's avatar
Steven Liu committed
280
281
> [!TIP]
> Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301

This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.


### Compute data type

To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:

```py
import torch
from diffusers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
```

### Normal Float 4 (NF4)

NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:

```py
302
303
304
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig

305
from diffusers import AutoModel
306
from transformers import T5EncoderModel
307

308
quant_config = TransformersBitsAndBytesConfig(
309
310
311
312
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

313
314
315
316
317
318
319
320
321
322
323
324
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

325
transformer_4bit = AutoModel.from_pretrained(
326
    "black-forest-labs/FLUX.1-dev",
327
    subfolder="transformer",
328
329
    quantization_config=quant_config,
    torch_dtype=torch.float16,
330
331
332
333
334
335
336
337
338
339
)
```

For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.

### Nested quantization

Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. 

```py
340
341
342
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig

343
from diffusers import AutoModel
344
from transformers import T5EncoderModel
345

346
quant_config = TransformersBitsAndBytesConfig(
347
348
349
350
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

351
352
353
354
355
356
357
358
359
360
361
362
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

363
transformer_4bit = AutoModel.from_pretrained(
364
    "black-forest-labs/FLUX.1-dev",
365
    subfolder="transformer",
366
367
    quantization_config=quant_config,
    torch_dtype=torch.float16,
368
369
370
371
372
)
```

## Dequantizing `bitsandbytes` models

373
Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model. 
374
375

```python
376
377
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
378

379
from diffusers import AutoModel
380
381
382
from transformers import T5EncoderModel

quant_config = TransformersBitsAndBytesConfig(
383
384
385
386
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

387
388
389
390
391
392
393
394
395
396
397
398
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

399
transformer_4bit = AutoModel.from_pretrained(
400
    "black-forest-labs/FLUX.1-dev",
401
    subfolder="transformer",
402
403
    quantization_config=quant_config,
    torch_dtype=torch.float16,
404
)
405
406
407

text_encoder_2_4bit.dequantize()
transformer_4bit.dequantize()
408
409
```

410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
## torch.compile

Speed up inference with `torch.compile`. Make sure you have the latest `bitsandbytes` installed and we also recommend installing [PyTorch nightly](https://pytorch.org/get-started/locally/).

<hfoptions id="bnb">
<hfoption id="8-bit">
```py
torch._dynamo.config.capture_dynamic_output_shape_ops = True

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_4bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)
transformer_4bit.compile(fullgraph=True)
```

</hfoption>
<hfoption id="4-bit">

```py
quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True)
transformer_4bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)
transformer_4bit.compile(fullgraph=True)
```
</hfoption>
</hfoptions>

On an RTX 4090 with compilation, 4-bit Flux generation completed in 25.809 seconds versus 32.570 seconds without.

Check out the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) for more details.

449
450
451
## Resources

* [End-to-end notebook showing Flux.1 Dev inference in a free-tier Colab](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4)
452
* [Training](https://github.com/huggingface/diffusers/blob/8c661ea586bf11cb2440da740dd3c4cf84679b85/examples/dreambooth/README_hidream.md#using-quantization)