Unverified Commit 972fdcc7 authored by Younes Belkada's avatar Younes Belkada Committed by GitHub
Browse files

[`Docs`/`quantization`] Clearer explanation on how things works under the...


[`Docs`/`quantization`] Clearer explanation on how things works under the hood. + remove outdated info (#25216)

* clearer explanation on how things works under the hood.

* Update docs/source/en/main_classes/quantization.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/main_classes/quantization.md
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add `load_in_4bit` in `from_pretrained`

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
parent 77c3973e
...@@ -29,6 +29,29 @@ If you want to quantize your own pytorch model, check out this [documentation](h ...@@ -29,6 +29,29 @@ If you want to quantize your own pytorch model, check out this [documentation](h
Here are the things you can do using `bitsandbytes` integration Here are the things you can do using `bitsandbytes` integration
### General usage
You can quantize a model by using the `load_in_8bit` or `load_in_4bit` argument when calling the [`~PreTrainedModel.from_pretrained`] method as long as your model supports loading with 🤗 Accelerate and contains `torch.nn.Linear` layers. This should work for any modality as well.
```python
from transformers import AutoModelForCausalLM
model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True)
```
By default all other modules (e.g. `torch.nn.LayerNorm`) will be converted in `torch.float16`, but if you want to change their `dtype` you can overwrite the `torch_dtype` argument:
```python
>>> import torch
>>> from transformers import AutoModelForCausalLM
>>> model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32)
>>> model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
torch.float32
```
### FP4 quantization ### FP4 quantization
#### Requirements #### Requirements
...@@ -41,7 +64,7 @@ Make sure that you have installed the requirements below before running any of t ...@@ -41,7 +64,7 @@ Make sure that you have installed the requirements below before running any of t
- Install latest `accelerate` - Install latest `accelerate`
`pip install --upgrade accelerate` `pip install --upgrade accelerate`
- Install latest `transformers` from source - Install latest `transformers`
`pip install --upgrade transformers` `pip install --upgrade transformers`
#### Tips and best practices #### Tips and best practices
......
...@@ -2126,10 +2126,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix ...@@ -2126,10 +2126,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
`True` when there is some disk offload. `True` when there is some disk offload.
load_in_8bit (`bool`, *optional*, defaults to `False`): load_in_8bit (`bool`, *optional*, defaults to `False`):
If `True`, will convert the loaded model into mixed-8bit quantized model. To use this feature please If `True`, will convert the loaded model into mixed-8bit quantized model. To use this feature please
install `bitsandbytes` compiled with your CUDA version by running `pip install -i install `bitsandbytes` (`pip install -U bitsandbytes`).
https://test.pypi.org/simple/ bitsandbytes-cudaXXX` where XXX is your CUDA version (e.g. 11.6 = 116). load_in_4bit (`bool`, *optional*, defaults to `False`):
Make also sure that you have enough GPU RAM to store half of the model size since the 8bit modules are If `True`, will convert the loaded model into 4bit precision quantized model. To use this feature
not compiled and adapted for CPUs. install the latest version of `bitsandbytes` (`pip install -U bitsandbytes`).
quantization_config (`Dict`, *optional*): quantization_config (`Dict`, *optional*):
A dictionary of configuration parameters for the `bitsandbytes` library and loading the model using A dictionary of configuration parameters for the `bitsandbytes` library and loading the model using
advanced features such as offloading in fp32 on CPU or on disk. advanced features such as offloading in fp32 on CPU or on disk.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment