integrations.mdx 4.79 KB
Newer Older
Titus's avatar
Titus committed
1
2
# Transformers

3
With Transformers it's very easy to load any model in 4 or 8-bit, quantizing them on the fly with `bitsandbytes` primitives.
Titus's avatar
Titus committed
4

5
Please review the [`bitsandbytes` section in the Transformers docs](https://huggingface.co/docs/transformers/main/en/quantization#bitsandbytes).
Titus's avatar
Titus committed
6

Titus's avatar
Titus committed
7
Details about the BitsAndBytesConfig can be found [here](https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/quantization#transformers.BitsAndBytesConfig).
Titus's avatar
Titus committed
8

Steven Liu's avatar
Steven Liu committed
9
10
11
12
> [!WARNING]
> **Beware: bf16 is the optimal compute data type!**
>
> If your hardware supports it, `bf16` is the optimal compute dtype. The default is `float32` for backward compatibility and numerical stability. `float16` often leads to numerical instabilities, but `bfloat16` provides the benefits of both worlds: numerical stability equivalent to float32, but combined with the memory footprint and significant computation speedup of a 16-bit data type. Therefore, be sure to check if your hardware supports `bf16` and configure it using the `bnb_4bit_compute_dtype` parameter in BitsAndBytesConfig:
Titus's avatar
Titus committed
13
14
15
16
17
18
19
20
21
22
23

```py
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
```

# PEFT
With `PEFT`, you can use QLoRA out of the box with `LoraConfig` and a 4-bit base model.

Titus's avatar
Titus committed
24
Please review the [bitsandbytes section in the PEFT docs](https://huggingface.co/docs/peft/developer_guides/quantization#quantize-a-model).
Titus's avatar
Titus committed
25
26
27

# Accelerate

28
Bitsandbytes is also easily usable from within Accelerate, where you can quantize any PyTorch model simply by passing a quantization config; e.g:
Titus's avatar
Titus committed
29

30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
```py
from accelerate import init_empty_weights
from accelerate.utils import BnbQuantizationConfig, load_and_quantize_model
from mingpt.model import GPT

model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

with init_empty_weights():
    empty_model = GPT(model_config)

bnb_quantization_config = BnbQuantizationConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.bfloat16,  # optional
  bnb_4bit_use_double_quant=True,         # optional
  bnb_4bit_quant_type="nf4"               # optional
)

quantized_model = load_and_quantize_model(
  empty_model,
  weights_location=weights_location,
  bnb_quantization_config=bnb_quantization_config,
  device_map = "auto"
)
```

For further details, e.g. model saving, cpu-offloading andfine-tuning, please review the [`bitsandbytes` section in the Accelerate docs](https://huggingface.co/docs/accelerate/en/usage_guides/quantization).
Titus's avatar
Titus committed
59

60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78


# PyTorch Lightning and Lightning Fabric

Bitsandbytes is available from within both
- [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), a deep learning framework for professional AI researchers and machine learning engineers who need maximal flexibility without sacrificing performance at scale;
-  and [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), a fast and lightweight way to scale PyTorch models without boilerplate).

Please review the [bitsandbytes section in the PyTorch Lightning docs](https://lightning.ai/docs/pytorch/stable/common/precision_intermediate.html#quantization-via-bitsandbytes).


# Lit-GPT

Bitsandbytes is integrated into [Lit-GPT](https://github.com/Lightning-AI/lit-gpt), a hackable implementation of state-of-the-art open-source large language models, based on Lightning Fabric, where it can be used for quantization during training, finetuning, and inference.

Please review the [bitsandbytes section in the Lit-GPT quantization docs](https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/quantize.md).



Titus's avatar
Titus committed
79
80
# Trainer for the optimizers

81
You can use any of the 8-bit and/or paged optimizers by simple passing them to the `transformers.Trainer` class on initialization.All bnb optimizers are supported by passing the correct string in `TrainingArguments`'s `optim` attribute - e.g. (`paged_adamw_32bit`).
Titus's avatar
Titus committed
82
83
84
85
86
87
88
89

See the [official API docs for reference](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer).

Here we point out to relevant doc sections in transformers / peft / Trainer + very briefly explain how these are integrated:
e.g. for transformers state that you can load any model in 8-bit / 4-bit precision, for PEFT, you can use QLoRA out of the box with `LoraConfig` + 4-bit base model, for Trainer: all bnb optimizers are supported by passing the correct string in `TrainingArguments`'s `optim` attribute - e.g. (`paged_adamw_32bit`):

# Blog posts

90
91
- [Making LLMs even more accessible with `bitsandbytes`, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and `bitsandbytes`](https://huggingface.co/blog/hf-bitsandbytes-integration)