optimizers.mdx 4.49 KB
Newer Older
1
# 8-bit optimizers
Titus's avatar
Titus committed
2

3
With 8-bit optimizers, large models can be finetuned with 75% less GPU memory without losing any accuracy compared to training with standard 32-bit optimizers. The reduced memory requirements means 8-bit optimizers are 4x faster than a standard optimizer, and no hyperparameter tuning is required.
Titus's avatar
Titus committed
4

5
This guide will show you how to use 8-bit optimizers.
Titus's avatar
Titus committed
6

7
8
> [!WARNING]
> 8-bit optimizers reduce memory usage and accelerate optimization on a wide range of tasks. However, since 8-bit optimizers only reduce memory proportional to the number of parameters, models that use large amounts of activation memory, such as convolutional networks, don't really benefit from 8-bit optimizers. 8-bit optimizers are most beneficial for training or finetuning models with many parameters on highly memory-constrained GPUs.
Titus's avatar
Titus committed
9

10
8-bit optimizers are a drop-in replacement for regular optimizers which means they also accept the same arguments as a regular optimizer. For NLP models, it is recommended to use the [`~nn.StableEmbedding`] class to improve stability and results.
Titus's avatar
Titus committed
11
12
13
14
15
16
17
18
19
20
21
22

```diff
import bitsandbytes as bnb

- adam = torch.optim.Adam(...)
+ adam = bnb.optim.Adam8bit(...)

# recommended for NLP models
- before: torch.nn.Embedding(...)
+ bnb.nn.StableEmbedding(...)
```

23
By default, all parameter tensors with less than 4096 elements are kept at 32-bits even if you initialize those parameters with 8-bit optimizers. This is done because small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm).
Titus's avatar
Titus committed
24

25
You can change this value with the `min_8bit_size` parameter. For example, if you want to optimize parameters to 8-bits only if the minimum size is 16384 values (it is recommended to use multiples of 4096):
Titus's avatar
Titus committed
26
27

```py
28
29
import bitsandbytes as bnb

Titus's avatar
Titus committed
30
31
32
adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
```

33
Other parameters you can configure include the learning rate (`lr`), the decay rates (`betas`), the number of bits of the optimizer state (`optim_bits`), and percentile clipping (`percentile_clipping`) which can increase stability. For example, to initialize a 32-bit [`~bitsandbytes.optim.Adam`] optimizer with 5th percentile clipping:
Titus's avatar
Titus committed
34

35
```py
Titus's avatar
Titus committed
36
37
import bitsandbytes as bnb

38
adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
Titus's avatar
Titus committed
39
40
```

41
## Optimize unstable parameters
Titus's avatar
Titus committed
42

43
To optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, use the [`~bitsandbytes.optim.GlobalOptimManager`] class to override the specific hyperparameters for a particular layer. You'll need to:
Titus's avatar
Titus committed
44

45
1. Register the parameters while they're on the CPU.
Titus's avatar
Titus committed
46
47
48
49
50
51
52
53

```py
import torch
import bitsandbytes as bnb

mng = bnb.optim.GlobalOptimManager.get_instance()

model = MyModel()
54
55
56
57
mng.register_parameters(model.parameters())
```

2. Override the config with the new desired hyperparameters. For example, let's override the `model.fc1.weight` layer to use 32-bit Adam.
Titus's avatar
Titus committed
58

59
60
61
62
> [!TIP]
> Check the optimizer API documentation for more information about other hyperparameters you can override.

```py
Titus's avatar
Titus committed
63
64
65
66
model = model.cuda()
# use 8-bit optimizer states for all parameters
adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)

67
68
69
# override the parameter model.fc1.weight now uses 32-bit Adam
mng.override_config(model.fc1.weight, "optim_bits", 32)
```
Titus's avatar
Titus committed
70

71
72
73
You can also override multiple layers at once by passing them as a list and the new hyperparameters as a dictionary. For example, let's override the `model.special.weight` and `model.also_special.weight` layers to use sparse optimization and a lower learning and decay rate.

```py
Titus's avatar
Titus committed
74
75
76
77
mng.override_config([model.special.weight, model.also_special.weight],
                    key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
```

78
79
For a specific layer, we recommend overriding locally in each module. Pass the module, the parameter, and its attribute name to the [`~bitsandbytes.optim.GlobalOptimManager`]:

Titus's avatar
Titus committed
80
81
```py
class MyModule(torch.nn.Module):
82
  def __init__(d_in, d_out):
Titus's avatar
Titus committed
83
    super(MyModule, self).__init__()
84
    self.linear = torch.nn.Linear(d_in, d_out)
Titus's avatar
Titus committed
85
86
87
88
89
90
91
    # optimization will happen in 32-bit and
    # learning rate will be set to 0.0001 independent of the main learning rate
    config = {'optim_bits': 32, 'lr' : 0.0001}
    GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)

```

92
## Next steps
Titus's avatar
Titus committed
93

94
For more conceptual details and explanation about 8-bit optimizers, take a look at the [8-bit optimizers](./explanations/optimizers) guide.