"...git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "eed46f38b77b83cdd3df4c20a1a12e1d20a31dac"
Unverified Commit 06a1d75b authored by Marc Sun's avatar Marc Sun Committed by GitHub
Browse files

fix gptq nits (#25500)

* fix nits

* fix docstring

* fix doc

* fix damp_percent

* fix doc
parent 80f29a25
...@@ -18,11 +18,11 @@ rendered properly in your Markdown viewer. ...@@ -18,11 +18,11 @@ rendered properly in your Markdown viewer.
## `AutoGPTQ` Integration ## `AutoGPTQ` Integration
🤗 Transformers has integrated `optimum` API to perform GPTQ quantization on language models. You can load and quantize your model in 8,6,4 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares. 🤗 Transformers has integrated `optimum` API to perform GPTQ quantization on language models. You can load and quantize your model in 8, 4, 3 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares.
To learn more about the the quantization model, check out: To learn more about the the quantization model, check out:
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper - the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
<!-- - the `optimum` [guide]() on GPTQ quantization --> - the `optimum` [guide](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) on GPTQ quantization
- the [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) library used as the backend - the [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) library used as the backend
### Requirements ### Requirements
...@@ -40,11 +40,12 @@ You need to have the following requirements installed to run the code below: ...@@ -40,11 +40,12 @@ You need to have the following requirements installed to run the code below:
- Install latest `accelerate` library - Install latest `accelerate` library
`pip install --upgrade accelerate` `pip install --upgrade accelerate`
GPTQ integration supports for now only text models and you may encounter unexpected behaviour for vision, speech or multi-modal models.
Note that GPTQ integration supports for now only text models and you may encounter unexpected behaviour for vision, speech or multi-modal models.
### Load and quantize a model ### Load and quantize a model
GPTQ is a quantization method that requires weights calibration before using the quantized models. If you want to quantize transformers model from scratch, it might take some time before producing the quantized model (~10 min on a Google colab for `facebook/opt-350m` model. GPTQ is a quantization method that requires weights calibration before using the quantized models. If you want to quantize transformers model from scratch, it might take some time before producing the quantized model (~5 min on a Google colab for `facebook/opt-350m` model).
Hence, there are two different scenarios where you want to use GPTQ-quantized models. The first use case would be to load models that has been already quantized by other users that are available on the Hub, the second use case would be to quantize your model from scratch and save it or push it on the Hub so that other users can also use it. Hence, there are two different scenarios where you want to use GPTQ-quantized models. The first use case would be to load models that has been already quantized by other users that are available on the Hub, the second use case would be to quantize your model from scratch and save it or push it on the Hub so that other users can also use it.
#### GPTQ Configuration #### GPTQ Configuration
......
...@@ -317,9 +317,9 @@ class GPTQConfig(QuantizationConfigMixin): ...@@ -317,9 +317,9 @@ class GPTQConfig(QuantizationConfigMixin):
original datasets used in GPTQ paper ['wikitext2','c4','c4-new','ptb','ptb-new'] original datasets used in GPTQ paper ['wikitext2','c4','c4-new','ptb','ptb-new']
group_size (`int`, *optional*, defaults to 128): group_size (`int`, *optional*, defaults to 128):
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization. The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
damp_percent (`float`, *optional*, defaults to 0.01): damp_percent (`float`, *optional*, defaults to 0.1):
The percent of the average Hessian diagonal to use for dampening. Recommended value is 0.01. The percent of the average Hessian diagonal to use for dampening. Recommended value is 0.1.
desc_act (`bool`, *optional*, defaults to `True`): desc_act (`bool`, *optional*, defaults to `False`):
Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly
speed up inference but the perplexity may become slightly worse. Also known as act-order. speed up inference but the perplexity may become slightly worse. Also known as act-order.
sym (`bool`, *optional*, defaults to `True`): sym (`bool`, *optional*, defaults to `True`):
...@@ -350,8 +350,8 @@ class GPTQConfig(QuantizationConfigMixin): ...@@ -350,8 +350,8 @@ class GPTQConfig(QuantizationConfigMixin):
tokenizer: Any = None, tokenizer: Any = None,
dataset: Optional[Union[List[str], str]] = None, dataset: Optional[Union[List[str], str]] = None,
group_size: int = 128, group_size: int = 128,
damp_percent: float = 0.01, damp_percent: float = 0.1,
desc_act: bool = True, desc_act: bool = False,
sym: bool = True, sym: bool = True,
true_sequential: bool = True, true_sequential: bool = True,
use_cuda_fp16: bool = False, use_cuda_fp16: bool = False,
...@@ -391,8 +391,8 @@ class GPTQConfig(QuantizationConfigMixin): ...@@ -391,8 +391,8 @@ class GPTQConfig(QuantizationConfigMixin):
r""" r"""
Safety checker that arguments are correct Safety checker that arguments are correct
""" """
if self.bits not in [2, 4, 6, 8]: if self.bits not in [2, 3, 4, 8]:
raise ValueError(f"Only support quantization to [2,4,6,8] bits but found {self.bits}") raise ValueError(f"Only support quantization to [2,3,4,8] bits but found {self.bits}")
if self.group_size != -1 and self.group_size <= 0: if self.group_size != -1 and self.group_size <= 0:
raise ValueError("group_size must be greater than 0 or equal to -1") raise ValueError("group_size must be greater than 0 or equal to -1")
if not (0 < self.damp_percent < 1): if not (0 < self.damp_percent < 1):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment