"examples/flax/vscode:/vscode.git/clone" did not exist on "a180efe7fdba14130357f728eac05e3b11d32bb0"
Unverified Commit 55db70c6 authored by Marc Sun's avatar Marc Sun Committed by GitHub
Browse files

GPTQ integration (#25062)



* GTPQ integration

* Add tests for gptq

* support for more quantization model

* fix style

* typo

* fix method

* Update src/transformers/modeling_utils.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add dataclass and fix quantization_method

* fix doc

* Update tests/quantization/gptq/test_gptq.py
Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>

* modify dataclass

* add gtpqconfig import

* fix typo

* fix tests

* remove dataset as req arg

* remove tokenizer import

* add offload cpu quantization test

* fix check dataset

* modify dockerfile

* protect trainer

* style

* test for config

* add more log

* overwrite torch_dtype

* draft doc

* modify quantization_config docstring

* fix class name in docstring

* Apply suggestions from code review
Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>

* more warning

* fix 8bit kwargs tests

* peft compatibility

* remove var

* fix is_gptq_quantized

* remove is_gptq_quantized

* fix wrap

* Update src/transformers/modeling_utils.py
Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>

* add exllama

* skip test

* overwrite float16

* style

* fix skip test

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix docsting formatting

* add doc

* better test

---------
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>
parent 34700123
......@@ -47,8 +47,11 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/acc
# Add bitsandbytes for mixed int8 testing
RUN python3 -m pip install --no-cache-dir bitsandbytes
# For bettertransformer
RUN python3 -m pip install --no-cache-dir optimum
# Add auto-gptq for gtpq quantization testing
RUN python3 -m pip install --no-cache-dir auto-gptq
# For bettertransformer + gptq
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
# For video model testing
RUN python3 -m pip install --no-cache-dir decord av==9.2.0
......
......@@ -16,6 +16,137 @@ rendered properly in your Markdown viewer.
# Quantize 🤗 Transformers models
## `AutoGPTQ` Integration
🤗 Transformers has integrated `optimum` API to perform GPTQ quantization on language models. You can load and quantize your model in 8,6,4 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares.
To learn more about the the quantization model, check out:
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
<!-- - the `optimum` [guide]() on GPTQ quantization -->
- the [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) library used as the backend
### Requirements
You need to have the following requirements installed to run the code below:
- Install latest `AutoGPTQ` library
`pip install auto-gptq`
- Install latest `optimum` from source
`pip install git+https://github.com/huggingface/optimum.git`
- Install latest `transformers` from source
`pip install git+https://github.com/huggingface/transformers.git`
- Install latest `accelerate` library
`pip install --upgrade accelerate`
GPTQ integration supports for now only text models and you may encounter unexpected behaviour for vision, speech or multi-modal models.
### Load and quantize a model
GPTQ is a quantization method that requires weights calibration before using the quantized models. If you want to quantize transformers model from scratch, it might take some time before producing the quantized model (~10 min on a Google colab for `facebook/opt-350m` model.
Hence, there are two different scenarios where you want to use GPTQ-quantized models. The first use case would be to load models that has been already quantized by other users that are available on the Hub, the second use case would be to quantize your model from scratch and save it or push it on the Hub so that other users can also use it.
#### GPTQ Configuration
In order to load and quantize a model, you need to create a [`GPTQConfig`]. You need to pass the number of `bits`, a `dataset` in order to calibrate the quantization and the `tokenizer` of the model in order prepare the dataset.
```python
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
gptq_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
```
Note that you can pass your own dataset as a list of string. However, it is highly recommended to use the dataset from the GPTQ paper.
```python
dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
quantization = GPTQConfig(bits=4, dataset = dataset, tokenizer=tokenizer)
```
#### Quantization
You can quantize a model by using `from_pretrained` and setting the `quantization_config`.
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=gptq_config)
```
Note that you will need a GPU to quantize a model. We will put the model in the cpu and move the modules back and forth to the gpu in order to quantize them.
If you want to maximize your gpus usage while using cpu offload, you can set `device_map = "auto"`.
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
```
Note that disk offload is not supported. Furthermore, if you are out of memory because of the dataset, you may have to pass `max_memory` in `from_pretained`. Checkout this [guide](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map) to learn more about `device_map` and `max_memory`.
<Tip warning={true}>
GPTQ quantization only works for text model for now. Futhermore, the quantization process can a lot of time depending on one's hardware (175B model = 4 gpu hours using NVIDIA A100). Please check on the hub if there is not a GPTQ quantized version of the model. If not, you can submit a demand on github.
</Tip>
### Push quantized model to 🤗 Hub
You can push the quantized model like any 🤗 model to Hub with `push_to_hub`. The quantization config will be saved and pushed along the model.
```python
quantized_model.push_to_hub("opt-125m-gptq")
tokenizer.push_to_hub("opt-125m-gptq")
```
If you want to save your quantized model on your local machine, you can also do it with `save_pretrained`:
```python
quantized_model.save_pretrained("opt-125m-gptq")
tokenizer.save_pretrained("opt-125m-gptq")
```
Note that if you have quantized your model with a `device_map`, make sure to move the entire model to one of your gpus or the `cpu` before saving it.
```python
quantized_model.to("cpu")
quantized_model.save_pretrained("opt-125m-gptq")
```
### Load a quantized model from the 🤗 Hub
You can load a quantized model from the Hub by using `from_pretrained`.
Make sure that the pushed weights are quantized, by checking that the attribute `quantization_config` is present in the model configuration object.
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq")
```
If you want to load a model faster and without allocating more memory than needed, the `device_map` argument also works with quantized model. Make sure that you have `accelerate` library installed.
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
```
### Exllama kernels for faster inference
For 4-bit model, you can use the exllama kernels in order to a faster inference speed. It is activated by default. You can change that behavior by passing `disable_exllama` in [`GPTQConfig`]. This will overwrite the quantization config stored in the config. Note that you will only be able to overwrite the attributes related to the kernels. Furthermore, you need to have the entire model on gpus if you want to use exllama kernels.
```py
import torch
gptq_config = GPTQConfig(bits=4, disable_exllama=False)
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config)
```
Note that only 4-bit models are supported for now. Furthermore, it is recommended to deactivate the exllama kernels if you are finetuning a quantized model with peft.
#### Fine-tune a quantized model
With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details.
### Example demo
Check out the Google Colab [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) to learn how to quantize your model with GPTQ and how finetune the quantized model with peft.
### GPTQConfig
[[autodoc]] GPTQConfig
## `bitsandbytes` Integration
🤗 Transformers is closely integrated with most used modules on `bitsandbytes`. You can load your model in 8-bit precision with few lines of code.
......@@ -215,7 +346,7 @@ This section is intended to advanced users, that want to explore what it is poss
One of the advanced use case of this is being able to load a model and dispatch the weights between `CPU` and `GPU`. Note that the weights that will be dispatched on CPU **will not** be converted in 8-bit, thus kept in `float32`. This feature is intended for users that want to fit a very large model and dispatch the model between GPU and CPU.
First, load a `BitsAndBytesConfig` from `transformers` and set the attribute `llm_int8_enable_fp32_cpu_offload` to `True`:
First, load a [`BitsAndBytesConfig`] from `transformers` and set the attribute `llm_int8_enable_fp32_cpu_offload` to `True`:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
......
......@@ -731,7 +731,7 @@ _import_structure = {
"logging",
],
"utils.bitsandbytes": [],
"utils.quantization_config": ["BitsAndBytesConfig"],
"utils.quantization_config": ["BitsAndBytesConfig", "GPTQConfig"],
}
# sentencepiece-backed objects
......@@ -4703,7 +4703,7 @@ if TYPE_CHECKING:
)
# bitsandbytes config
from .utils.quantization_config import BitsAndBytesConfig
from .utils.quantization_config import BitsAndBytesConfig, GPTQConfig
try:
if not is_sentencepiece_available():
......
......@@ -64,6 +64,7 @@ from .utils import (
download_url,
has_file,
is_accelerate_available,
is_auto_gptq_available,
is_bitsandbytes_available,
is_offline_mode,
is_optimum_available,
......@@ -75,7 +76,7 @@ from .utils import (
)
from .utils.hub import convert_file_size_to_int, get_checkpoint_shard_files
from .utils.import_utils import ENV_VARS_TRUE_VALUES, is_sagemaker_mp_enabled, is_torch_fx_proxy
from .utils.quantization_config import BitsAndBytesConfig
from .utils.quantization_config import BitsAndBytesConfig, GPTQConfig, QuantizationMethod
from .utils.versions import require_version_core
......@@ -1915,7 +1916,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
@wraps(torch.nn.Module.cuda)
def cuda(self, *args, **kwargs):
# Checks if the model has been loaded in 8-bit
if getattr(self, "is_quantized", False):
if getattr(self, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES:
raise ValueError(
"Calling `cuda()` is not supported for `4-bit` or `8-bit` quantized models. Please use the model as it is, since the"
" model has already been set to the correct devices and casted to the correct `dtype`."
......@@ -1926,29 +1927,29 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
@wraps(torch.nn.Module.to)
def to(self, *args, **kwargs):
# Checks if the model has been loaded in 8-bit
if getattr(self, "is_quantized", False):
if getattr(self, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES:
raise ValueError(
"`.to` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the"
"`.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the"
" model has already been set to the correct devices and casted to the correct `dtype`."
)
else:
return super().to(*args, **kwargs)
def half(self, *args):
# Checks if the model has been loaded in 8-bit
# Checks if the model is quantized
if getattr(self, "is_quantized", False):
raise ValueError(
"`.half()` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the"
"`.half()` is not supported for quantized model. Please use the model as it is, since the"
" model has already been casted to the correct `dtype`."
)
else:
return super().half(*args)
def float(self, *args):
# Checks if the model has been loaded in 8-bit
# Checks if the model is quantized
if getattr(self, "is_quantized", False):
raise ValueError(
"`.float()` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the"
"`.float()` is not supported for quantized model. Please use the model as it is, since the"
" model has already been casted to the correct `dtype`."
)
else:
......@@ -2130,9 +2131,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
load_in_4bit (`bool`, *optional*, defaults to `False`):
If `True`, will convert the loaded model into 4bit precision quantized model. To use this feature
install the latest version of `bitsandbytes` (`pip install -U bitsandbytes`).
quantization_config (`Dict`, *optional*):
A dictionary of configuration parameters for the `bitsandbytes` library and loading the model using
advanced features such as offloading in fp32 on CPU or on disk.
quantization_config (`Union[QuantizationConfigMixin,Dict]`, *optional*):
A dictionary of configuration parameters or a QuantizationConfigMixin object for quantization (e.g
bitsandbytes, gptq)
subfolder (`str`, *optional*, defaults to `""`):
In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
specify the folder name here.
......@@ -2287,13 +2288,20 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
"Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`"
)
if quantization_config is None:
quantization_method_from_args = None
if quantization_config is not None:
quantization_method_from_args = getattr(
quantization_config, "quant_method", QuantizationMethod.BITS_AND_BYTES
)
if quantization_config is None and (load_in_8bit or load_in_4bit):
quantization_method_from_args = QuantizationMethod.BITS_AND_BYTES
quantization_config, kwargs = BitsAndBytesConfig.from_dict(
config_dict={"load_in_8bit": load_in_8bit, "load_in_4bit": load_in_4bit},
return_unused_kwargs=True,
**kwargs,
)
elif quantization_config is not None:
elif quantization_method_from_args == QuantizationMethod.BITS_AND_BYTES:
load_in_8bit = quantization_config.load_in_8bit
load_in_4bit = quantization_config.load_in_4bit
......@@ -2375,15 +2383,63 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
else:
model_kwargs = kwargs
if is_8bit_serializable and quantization_config is not None and load_in_8bit:
quantizer = None
quantization_method_from_config = None
if hasattr(config, "quantization_config"):
quantization_method_from_config = config.quantization_config.get(
"quant_method", QuantizationMethod.BITS_AND_BYTES
)
if quantization_method_from_config == QuantizationMethod.GPTQ and quantization_method_from_args is not None:
loading_attr_dict = quantization_config.get_loading_attributes()
for attr, val in loading_attr_dict.items():
config.quantization_config[attr] = val
quantization_method_from_args = None
logger.warning(
"You passed `quantization_config` to `from_pretrained` but the model you're loading already has a "
"`quantization_config` attribute and has already quantized weights. However, loading attributes"
" (e.g. disable_exllama, use_cuda_fp16) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored."
)
if (
quantization_method_from_args == QuantizationMethod.GPTQ
or quantization_method_from_config == QuantizationMethod.GPTQ
):
if not torch.cuda.is_available():
raise RuntimeError("GPU is required to quantize or run quantize model.")
elif not (is_optimum_available() and is_auto_gptq_available()):
raise ImportError(
"Loading GPTQ quantized model requires optimum library : `pip install optimum` and auto-gptq library 'pip install auto-gptq'"
)
else:
# Need to protect the import
from optimum.gptq import GPTQQuantizer
if quantization_method_from_config == QuantizationMethod.GPTQ:
quantization_config = GPTQConfig.from_dict(config.quantization_config)
config.quantization_config = quantization_config
logger.info(
f"Overriding torch_dtype={torch_dtype} with `torch_dtype=torch.float16` due to "
"requirements of `auto-gptq` to enable model quantization "
)
torch_dtype = torch.float16
quantizer = GPTQQuantizer.from_dict(quantization_config.to_dict())
if (
is_8bit_serializable
and quantization_method_from_args == QuantizationMethod.BITS_AND_BYTES
and load_in_8bit
):
if quantization_method_from_config == QuantizationMethod.BITS_AND_BYTES:
logger.warning(
"You passed `quantization_config` to `from_pretrained` but the model you're loading already has a"
" `quantization_config` attribute. The `quantization_config` attribute will be overwritten with the"
" one you passed to `from_pretrained`."
)
config.quantization_config = quantization_config
elif is_8bit_serializable and not load_in_8bit and hasattr(config, "quantization_config"):
elif (
is_8bit_serializable
and not load_in_8bit
and quantization_method_from_config == QuantizationMethod.BITS_AND_BYTES
):
quantization_config = config.quantization_config
if isinstance(quantization_config, dict):
quantization_config = BitsAndBytesConfig.from_dict(quantization_config, return_unused_kwargs=False)
......@@ -2413,7 +2469,11 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
if low_cpu_mem_usage is None:
low_cpu_mem_usage = True
elif not is_8bit_serializable and not load_in_8bit and hasattr(config, "quantization_config"):
elif (
not is_8bit_serializable
and not load_in_8bit
and quantization_method_from_config == QuantizationMethod.BITS_AND_BYTES
):
logger.warning(
"Detected the presence of a `quantization_config` attribute in the model's configuration but you don't have the correct"
" `bitsandbytes` version to support int8 serialization. Please install the latest version of `bitsandbytes` with "
......@@ -2800,6 +2860,16 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
"All non-linear modules will be loaded in full precision."
" If you want to load the other modules in other precision, please specify a `torch_dtype` attribute."
)
if quantization_method_from_config == QuantizationMethod.GPTQ:
model = quantizer.convert_model(model)
model._is_quantized_training_enabled = True
if quantization_method_from_config is not None:
model.quantization_method = quantization_method_from_config
elif quantization_method_from_args is not None:
model.quantization_method = quantization_method_from_args
if hasattr(model, "quantization_method"):
model.is_quantized = True
if isinstance(device_map, str):
special_dtypes = {}
......@@ -2951,13 +3021,12 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
offload_folder=offload_folder,
offload_state_dict=offload_state_dict,
dtype=torch_dtype,
is_quantized=(load_in_8bit or load_in_4bit),
is_quantized=(getattr(model, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES),
keep_in_fp32_modules=keep_in_fp32_modules,
)
model.is_loaded_in_4bit = load_in_4bit
model.is_loaded_in_8bit = load_in_8bit
model.is_quantized = load_in_8bit or load_in_4bit
# make sure token embedding weights are still tied if needed
model.tie_weights()
......@@ -2999,6 +3068,17 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
device_map_kwargs["skip_keys"] = model._skip_keys_device_placement
dispatch_model(model, **device_map_kwargs)
if quantization_method_from_args == QuantizationMethod.GPTQ:
if quantization_config.tokenizer is None:
quantization_config.tokenizer = pretrained_model_name_or_path
if cls.main_input_name != "input_ids":
raise RuntimeError("We can only quantize pure text model.")
quantizer.quantize_model(model, quantization_config.tokenizer)
model.config.quantization_config = GPTQConfig.from_dict(quantizer.to_dict())
model._is_quantized_training_enabled = True
if quantization_method_from_config == QuantizationMethod.GPTQ:
model = quantizer.post_init_model(model)
if output_loading_info:
if loading_info is None:
loading_info = {
......
......@@ -475,6 +475,9 @@ class _BaseAutoModelClass:
# meaningless in the context of the config object - torch.dtype values are acceptable
if kwargs.get("torch_dtype", None) == "auto":
_ = kwargs.pop("torch_dtype")
# to not overwrite the quantization_config if config has a quantization_config
if kwargs.get("quantization_config", None) is not None:
_ = kwargs.pop("quantization_config")
config, kwargs = AutoConfig.from_pretrained(
pretrained_model_name_or_path,
......@@ -487,6 +490,8 @@ class _BaseAutoModelClass:
# if torch_dtype=auto was passed here, ensure to pass it on
if kwargs_orig.get("torch_dtype", None) == "auto":
kwargs["torch_dtype"] = "auto"
if kwargs_orig.get("quantization_config", None) is not None:
kwargs["quantization_config"] = kwargs_orig["quantization_config"]
has_remote_code = hasattr(config, "auto_map") and cls.__name__ in config.auto_map
has_local_code = type(config) in cls._model_mapping.keys()
......
......@@ -51,6 +51,7 @@ from .integrations import (
from .utils import (
is_accelerate_available,
is_apex_available,
is_auto_gptq_available,
is_bitsandbytes_available,
is_bs4_available,
is_cython_available,
......@@ -776,6 +777,13 @@ def require_optimum(test_case):
return unittest.skipUnless(is_optimum_available(), "test requires optimum")(test_case)
def require_auto_gptq(test_case):
"""
Decorator for auto_gptq dependency
"""
return unittest.skipUnless(is_auto_gptq_available(), "test requires auto-gptq")(test_case)
def require_phonemizer(test_case):
"""
Decorator marking a test that requires phonemizer
......
......@@ -146,6 +146,7 @@ from .utils import (
logging,
strtobool,
)
from .utils.quantization_config import QuantizationMethod
DEFAULT_CALLBACKS = [DefaultFlowCallback]
......@@ -396,10 +397,9 @@ class Trainer:
if getattr(model, "is_quantized", False):
if getattr(model, "_is_quantized_training_enabled", False):
logger.info(
"The model is loaded in 8-bit precision. To train this model you need to add additional modules"
"The model is quantized. To train this model you need to add additional modules"
" inside the model such as adapters using `peft` library and freeze the model weights. Please"
" check "
" the examples in https://github.com/huggingface/peft for more details."
" check the examples in https://github.com/huggingface/peft for more details."
)
else:
raise ValueError(
......@@ -498,8 +498,11 @@ class Trainer:
self.eval_dataset = eval_dataset
self.tokenizer = tokenizer
# Quantized models doesn't support `.to` operation.
if self.place_model_on_device and not getattr(model, "is_quantized", False):
# Bnb Quantized models doesn't support `.to` operation.
if (
self.place_model_on_device
and not getattr(model, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES
):
self._move_model_to_device(model, args.device)
# Force n_gpu to 1 to avoid DataParallel as MP will manage the GPUs
......
......@@ -104,6 +104,7 @@ from .import_utils import (
get_torch_version,
is_accelerate_available,
is_apex_available,
is_auto_gptq_available,
is_bitsandbytes_available,
is_bs4_available,
is_coloredlogs_available,
......
......@@ -98,6 +98,7 @@ _natten_available = _is_package_available("natten")
_onnx_available = _is_package_available("onnx")
_openai_available = _is_package_available("openai")
_optimum_available = _is_package_available("optimum")
_auto_gptq_available = _is_package_available("auto_gptq")
_pandas_available = _is_package_available("pandas")
_peft_available = _is_package_available("peft")
_phonemizer_available = _is_package_available("phonemizer")
......@@ -554,6 +555,10 @@ def is_optimum_available():
return _optimum_available
def is_auto_gptq_available():
return _auto_gptq_available
def is_optimum_neuron_available():
return _optimum_available and _is_package_available("optimum.neuron")
......
......@@ -19,7 +19,8 @@ import importlib.metadata
import json
import os
from dataclasses import dataclass
from typing import Any, Dict, Union
from enum import Enum
from typing import Any, Dict, List, Optional, Union
from packaging import version
......@@ -33,8 +34,100 @@ if is_torch_available():
logger = logging.get_logger(__name__)
class QuantizationMethod(str, Enum):
BITS_AND_BYTES = "bitsandbytes"
GPTQ = "gptq"
@dataclass
class BitsAndBytesConfig:
class QuantizationConfigMixin:
"""
Mixin class for quantization config
"""
quant_method: QuantizationMethod
@classmethod
def from_dict(cls, config_dict, return_unused_kwargs=False, **kwargs):
"""
Instantiates a [`QuantizationConfigMixin`] from a Python dictionary of parameters.
Args:
config_dict (`Dict[str, Any]`):
Dictionary that will be used to instantiate the configuration object.
return_unused_kwargs (`bool`,*optional*, defaults to `False`):
Whether or not to return a list of unused keyword arguments. Used for `from_pretrained` method in
`PreTrainedModel`.
kwargs (`Dict[str, Any]`):
Additional parameters from which to initialize the configuration object.
Returns:
[`QuantizationConfigMixin`]: The configuration object instantiated from those parameters.
"""
config = cls(**config_dict)
to_remove = []
for key, value in kwargs.items():
if hasattr(config, key):
setattr(config, key, value)
to_remove.append(key)
for key in to_remove:
kwargs.pop(key, None)
if return_unused_kwargs:
return config, kwargs
else:
return config
def to_json_file(self, json_file_path: Union[str, os.PathLike]):
"""
Save this instance to a JSON file.
Args:
json_file_path (`str` or `os.PathLike`):
Path to the JSON file in which this configuration instance's parameters will be saved.
use_diff (`bool`, *optional*, defaults to `True`):
If set to `True`, only the difference between the config instance and the default
`QuantizationConfig()` is serialized to JSON file.
"""
with open(json_file_path, "w", encoding="utf-8") as writer:
config_dict = self.to_dict()
json_string = json.dumps(config_dict, indent=2, sort_keys=True) + "\n"
writer.write(json_string)
def to_dict(self) -> Dict[str, Any]:
"""
Serializes this instance to a Python dictionary. Returns:
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
"""
return copy.deepcopy(self.__dict__)
def __repr__(self):
return f"{self.__class__.__name__} {self.to_json_string()}"
def to_json_string(self, use_diff: bool = True) -> str:
"""
Serializes this instance to a JSON string.
Args:
use_diff (`bool`, *optional*, defaults to `True`):
If set to `True`, only the difference between the config instance and the default `PretrainedConfig()`
is serialized to JSON string.
Returns:
`str`: String containing all the attributes that make up this configuration instance in JSON format.
"""
if use_diff is True:
config_dict = self.to_diff_dict()
else:
config_dict = self.to_dict()
return json.dumps(config_dict, indent=2, sort_keys=True) + "\n"
@dataclass
class BitsAndBytesConfig(QuantizationConfigMixin):
"""
This is a wrapper class about all possible attributes and features that you can play with a model that has been
loaded using `bitsandbytes`.
......@@ -97,6 +190,7 @@ class BitsAndBytesConfig:
bnb_4bit_use_double_quant=False,
**kwargs,
):
self.quant_method = QuantizationMethod.BITS_AND_BYTES
self.load_in_8bit = load_in_8bit
self.load_in_4bit = load_in_4bit
self.llm_int8_threshold = llm_int8_threshold
......@@ -168,88 +262,16 @@ class BitsAndBytesConfig:
else:
return None
@classmethod
def from_dict(cls, config_dict, return_unused_kwargs, **kwargs):
"""
Instantiates a [`BitsAndBytesConfig`] from a Python dictionary of parameters.
Args:
config_dict (`Dict[str, Any]`):
Dictionary that will be used to instantiate the configuration object.
return_unused_kwargs (`bool`):
Whether or not to return a list of unused keyword arguments. Used for `from_pretrained` method in
`PreTrainedModel`.
kwargs (`Dict[str, Any]`):
Additional parameters from which to initialize the configuration object.
Returns:
[`BitsAndBytesConfig`]: The configuration object instantiated from those parameters.
"""
config = cls(**config_dict)
to_remove = []
for key, value in kwargs.items():
if hasattr(config, key):
setattr(config, key, value)
to_remove.append(key)
for key in to_remove:
kwargs.pop(key, None)
if return_unused_kwargs:
return config, kwargs
else:
return config
def to_json_file(self, json_file_path: Union[str, os.PathLike]):
"""
Save this instance to a JSON file.
Args:
json_file_path (`str` or `os.PathLike`):
Path to the JSON file in which this configuration instance's parameters will be saved.
use_diff (`bool`, *optional*, defaults to `True`):
If set to `True`, only the difference between the config instance and the default
`BitsAndBytesConfig()` is serialized to JSON file.
"""
with open(json_file_path, "w", encoding="utf-8") as writer:
config_dict = self.to_dict()
json_string = json.dumps(config_dict, indent=2, sort_keys=True) + "\n"
writer.write(json_string)
def to_dict(self) -> Dict[str, Any]:
"""
Serializes this instance to a Python dictionary. Returns:
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
"""
output = copy.deepcopy(self.__dict__)
output["bnb_4bit_compute_dtype"] = str(output["bnb_4bit_compute_dtype"]).split(".")[1]
return output
def __repr__(self):
return f"{self.__class__.__name__} {self.to_json_string()}"
def to_json_string(self, use_diff: bool = True) -> str:
"""
Serializes this instance to a JSON string.
Args:
use_diff (`bool`, *optional*, defaults to `True`):
If set to `True`, only the difference between the config instance and the default `PretrainedConfig()`
is serialized to JSON string.
Returns:
`str`: String containing all the attributes that make up this configuration instance in JSON format.
"""
if use_diff is True:
config_dict = self.to_diff_dict()
else:
config_dict = self.to_dict()
return json.dumps(config_dict, indent=2, sort_keys=True) + "\n"
def to_diff_dict(self) -> Dict[str, Any]:
"""
Removes all attributes from config which correspond to the default config attributes for better readability and
......@@ -271,3 +293,119 @@ class BitsAndBytesConfig:
serializable_config_dict[key] = value
return serializable_config_dict
@dataclass
class GPTQConfig(QuantizationConfigMixin):
"""
This is a wrapper class about all possible attributes and features that you can play with a model that has been
loaded using `optimum` api for gptq quantization relying on auto_gptq backend.
Args:
bits (`int`):
The number of bits to quantize to, supported numbers are (2, 3, 4, 8).
tokenizer (`str` or `PreTrainedTokenizerBase`, *optional*):
The tokenizer used to process the dataset. You can pass either:
- A custom tokenizer object.
- A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
user or organization name, like `dbmdz/bert-base-german-cased`.
- A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`.
dataset (`Union[List[str]]`, *optional*):
The dataset used for quantization. You can provide your own dataset in a list of string or just use the
original datasets used in GPTQ paper ['wikitext2','c4','c4-new','ptb','ptb-new']
group_size (`int`, *optional*, defaults to 128):
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
damp_percent (`float`, *optional*, defaults to 0.01):
The percent of the average Hessian diagonal to use for dampening. Recommended value is 0.01.
desc_act (`bool`, *optional*, defaults to `True`):
Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly
speed up inference but the perplexity may become slightly worse. Also known as act-order.
sym (`bool`, *optional*, defaults to `True`):
Whether to use symetric quantization.
true_sequential (`bool`, *optional*, defaults to `True`):
Whether to perform sequential quantization even within a single Transformer block. Instead of quantizing
the entire block at once, we perform layer-wise quantization. As a result, each layer undergoes
quantization using inputs that have passed through the previously quantized layers.
use_cuda_fp16 (`bool`, *optional*, defaults to `False`):
Whether or not to use optimized cuda kernel for fp16 model. Need to have model in fp16.
model_seqlen (`int`, *optional*):
The maximum sequence length that the model can take.
block_name_to_quantize (`str`, *optional*):
The transformers block name to quantize.
module_name_preceding_first_block (`List[str]`, *optional*):
The layers that are preceding the first Transformer block.
batch_size (`int`, *optional*, defaults to 1):
The batch size used when processing the dataset
pad_token_id (`int`, *optional*):
The pad token id. Needed to prepare the dataset when `batch_size` > 1.
disable_exllama (`bool`, *optional*, defaults to `False`):
Whether to use exllama backend. Only works with `bits` = 4.
"""
def __init__(
self,
bits: int,
tokenizer: Any = None,
dataset: Optional[Union[List[str], str]] = None,
group_size: int = 128,
damp_percent: float = 0.01,
desc_act: bool = True,
sym: bool = True,
true_sequential: bool = True,
use_cuda_fp16: bool = False,
model_seqlen: Optional[int] = None,
block_name_to_quantize: Optional[str] = None,
module_name_preceding_first_block: Optional[List[str]] = None,
batch_size: int = 1,
pad_token_id: Optional[int] = None,
disable_exllama: bool = False,
**kwargs,
):
self.quant_method = QuantizationMethod.GPTQ
self.bits = bits
self.tokenizer = tokenizer
self.dataset = dataset
self.group_size = group_size
self.damp_percent = damp_percent
self.desc_act = desc_act
self.sym = sym
self.true_sequential = true_sequential
self.use_cuda_fp16 = use_cuda_fp16
self.model_seqlen = model_seqlen
self.block_name_to_quantize = block_name_to_quantize
self.module_name_preceding_first_block = module_name_preceding_first_block
self.batch_size = batch_size
self.pad_token_id = pad_token_id
self.disable_exllama = disable_exllama
self.post_init()
def get_loading_attributes(self):
attibutes_dict = copy.deepcopy(self.__dict__)
loading_attibutes = ["disable_exllama", "use_cuda_fp16"]
loading_attibutes_dict = {i: j for i, j in attibutes_dict.items() if i in loading_attibutes}
return loading_attibutes_dict
def post_init(self):
r"""
Safety checker that arguments are correct
"""
if self.bits not in [2, 4, 6, 8]:
raise ValueError(f"Only support quantization to [2,4,6,8] bits but found {self.bits}")
if self.group_size != -1 and self.group_size <= 0:
raise ValueError("group_size must be greater than 0 or equal to -1")
if not (0 < self.damp_percent < 1):
raise ValueError("damp_percent must between 0 and 1.")
if self.dataset is not None:
if isinstance(self.dataset, str):
if self.dataset not in ["wikitext2", "c4", "c4-new", "ptb", "ptb-new"]:
raise ValueError(
f"""You have entered a string value for dataset. You can only choose between
['wikitext2','c4','c4-new','ptb','ptb-new'], but we found {self.dataset}"""
)
elif not isinstance(self.dataset, list):
raise ValueError(
f"""dataset needs to be either a list of string or a value in
['wikitext2','c4','c4-new','ptb','ptb-new'], but we found {self.dataset}"""
)
......@@ -753,6 +753,7 @@ class MixedInt8TestCpuGpu(BaseMixedInt8Test):
model_8bit = AutoModelForCausalLM.from_pretrained(
self.model_name,
device_map=device_map,
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=True,
offload_folder=tmpdirname,
)
......
# coding=utf-8
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import tempfile
import unittest
import pytest
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
from transformers.testing_utils import (
is_torch_available,
require_accelerate,
require_auto_gptq,
require_optimum,
require_torch_gpu,
require_torch_multi_gpu,
slow,
)
if is_torch_available():
import torch
class GPTQConfigTest(unittest.TestCase):
def test_bits(self):
with self.assertRaises(ValueError):
GPTQConfig(bits="")
GPTQConfig(bits=1)
GPTQConfig(bits=2)
GPTQConfig(bits=4)
def test_dataset(self):
with self.assertRaises(ValueError):
GPTQConfig(bits=2, dataset="auto_gpt")
GPTQConfig(bits=2, dataset="c4")
GPTQConfig(bits=2, dataset="ptb-new")
def test_damp_percent(self):
with self.assertRaises(ValueError):
GPTQConfig(bits=2, damp_percent=10)
GPTQConfig(bits=2, damp_percent=-1)
GPTQConfig(bits=2, damp_percent="0")
GPTQConfig(bits=2, damp_percent=0.01)
def test_to_dict(self):
quantization_config = GPTQConfig(bits=2)
quantization_config.to_dict()
def test_from_dict(self):
dict = {"bits": 2}
quantization_config = GPTQConfig.from_dict(dict)
self.assertEqual(dict["bits"], quantization_config.bits)
@require_optimum
def test_optimum_config(self):
from optimum.gptq import GPTQQuantizer
config = GPTQConfig(bits=2)
optimum_config = GPTQQuantizer.from_dict(config.to_dict())
self.assertEqual(optimum_config.bits, config.bits)
new_config = GPTQConfig.from_dict(optimum_config.to_dict())
self.assertEqual(optimum_config.bits, new_config.bits)
@slow
@require_optimum
@require_auto_gptq
@require_torch_gpu
class GPTQTest(unittest.TestCase):
model_name = "bigscience/bloom-560m"
input_text = "Hello my name is"
EXPECTED_OUTPUTS = set()
EXPECTED_OUTPUTS.add("Hello my name is John and I am a professional photographer. I")
EXPECTED_OUTPUTS.add("Hello my name is John and I am a very good looking man.")
EXPECTED_OUTPUTS.add("Hello my name is Alyson and I am a professional photographer")
# this seems a little small considering that we are doing 4bit quant but we have a small model and ww don't quantize the embeddings
EXPECTED_RELATIVE_DIFFERENCE = 1.664253062
bits = 4
group_size = 128
desc_act = False
disable_exllama = True
dataset = [
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
]
device_map = None
# called only once for all test in this class
@classmethod
def setUpClass(cls):
"""
Setup quantized model
"""
cls.model_fp16 = AutoModelForCausalLM.from_pretrained(
cls.model_name, torch_dtype=torch.float16, device_map=cls.device_map
)
cls.mem_fp16 = cls.model_fp16.get_memory_footprint()
cls.tokenizer = AutoTokenizer.from_pretrained(cls.model_name, use_fast=True)
quantization_config = GPTQConfig(
bits=cls.bits,
dataset=cls.dataset,
tokenizer=cls.tokenizer,
group_size=cls.group_size,
desc_act=cls.desc_act,
disable_exllama=cls.disable_exllama,
)
cls.quantized_model = AutoModelForCausalLM.from_pretrained(
cls.model_name,
torch_dtype=torch.float16,
device_map=cls.device_map,
quantization_config=quantization_config,
)
def test_memory_footprint(self):
r"""
A simple test to check if the model conversion has been done correctly by checking on the
memory footprint of the converted model
"""
mem_quantized = self.quantized_model.get_memory_footprint()
self.assertAlmostEqual(self.mem_fp16 / mem_quantized, self.EXPECTED_RELATIVE_DIFFERENCE)
def test_quantized_layers_class(self):
"""
Simple test to check if the model conversion has been done correctly by checking on
the class type of the linear layers of the converted models
"""
from auto_gptq.utils.import_utils import dynamically_import_QuantLinear
QuantLinear = dynamically_import_QuantLinear(
use_triton=False,
desc_act=self.desc_act,
group_size=self.group_size,
bits=self.bits,
disable_exllama=self.disable_exllama,
)
self.assertTrue(self.quantized_model.transformer.h[0].mlp.dense_4h_to_h.__class__ == QuantLinear)
def check_inference_correctness(self, model):
r"""
Test the generation quality of the quantized model and see that we are matching the expected output.
Given that we are operating on small numbers + the testing model is relatively small, we might not get
the same output across GPUs. So we'll generate few tokens (5-10) and check their output.
"""
# Check that inference pass works on the model
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
# Check the exactness of the results
output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
# Get the generation
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
def test_generate_quality(self):
"""
Simple test to check the quality of the model by comapring the the generated tokens with the expected tokens
"""
if self.device_map is None:
self.check_inference_correctness(self.quantized_model.to(0))
else:
self.check_inference_correctness(self.quantized_model)
def test_serialization(self):
"""
Test the serialization of the model and the loading of the quantized weights works
"""
with tempfile.TemporaryDirectory() as tmpdirname:
self.quantized_model.save_pretrained(tmpdirname)
if self.disable_exllama:
quantized_model_from_saved = AutoModelForCausalLM.from_pretrained(tmpdirname).to(0)
else:
# we need to put it directly to the gpu. Otherwise, we won't be able to initialize the exllama kernel
quantized_model_from_saved = AutoModelForCausalLM.from_pretrained(tmpdirname, device_map={"": 0})
self.check_inference_correctness(quantized_model_from_saved)
@require_accelerate
def test_serialization_big_model_inference(self):
"""
Test the serialization of the model and the loading of the quantized weights with big model inference
"""
with tempfile.TemporaryDirectory() as tmpdirname:
self.quantized_model.save_pretrained(tmpdirname)
quantized_model_from_saved = AutoModelForCausalLM.from_pretrained(tmpdirname, device_map="auto")
self.check_inference_correctness(quantized_model_from_saved)
def test_change_loading_attributes(self):
"""
Test the serialization of the model and the loading of the quantized weights works with another config file
"""
with tempfile.TemporaryDirectory() as tmpdirname:
self.quantized_model.save_pretrained(tmpdirname)
if self.disable_exllama:
self.assertEqual(self.quantized_model.config.quantization_config.disable_exllama, True)
# we need to put it directly to the gpu. Otherwise, we won't be able to initialize the exllama kernel
quantized_model_from_saved = AutoModelForCausalLM.from_pretrained(
tmpdirname, quantization_config=GPTQConfig(disable_exllama=False, bits=6), device_map={"": 0}
)
self.assertEqual(quantized_model_from_saved.config.quantization_config.disable_exllama, False)
self.assertEqual(quantized_model_from_saved.config.quantization_config.bits, self.bits)
self.check_inference_correctness(quantized_model_from_saved)
@require_accelerate
@require_torch_multi_gpu
class GPTQTestDeviceMap(GPTQTest):
device_map = "auto"
@require_accelerate
@require_torch_multi_gpu
class GPTQTestDeviceMapExllama(GPTQTest):
device_map = "auto"
disable_exllama = False
# fail when run all together
@pytest.mark.skip
@require_accelerate
@require_torch_multi_gpu
class GPTQTestDeviceMapCPUOffload(GPTQTest):
device_map = {
"transformer.word_embeddings": 0,
"transformer.word_embeddings_layernorm": 0,
"lm_head": 0,
"transformer.h.0": 0,
"transformer.h.1": 0,
"transformer.h.2": 0,
"transformer.h.3": 0,
"transformer.h.4": 0,
"transformer.h.5": 0,
"transformer.h.6": 0,
"transformer.h.7": 0,
"transformer.h.8": 0,
"transformer.h.9": 0,
"transformer.h.10": 1,
"transformer.h.11": 1,
"transformer.h.12": 1,
"transformer.h.13": 1,
"transformer.h.14": 1,
"transformer.h.15": 1,
"transformer.h.16": 1,
"transformer.h.17": 0,
"transformer.h.18": "cpu",
"transformer.h.19": "cpu",
"transformer.h.20": "cpu",
"transformer.h.21": "cpu",
"transformer.h.22": "cpu",
"transformer.h.23": 1,
"transformer.ln_f": 0,
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment