Commit da900c3b authored by yangql's avatar yangql
Browse files

Initial commit

parents
<h1 align="center">AutoGPTQ</h1>
<p align="center">An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).</p>
<p align="center">
<a href="https://github.com/PanQiWei/AutoGPTQ/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg">
</a>
<a href="https://pypi.org/project/auto-gptq/">
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/auto-gptq">
</a>
</p>
<h4 align="center">
<p>
<b>English</b> |
<a href="https://github.com/PanQiWei/AutoGPTQ/blob/main/README_zh.md">中文</a>
</p>
</h4>
## News or Update
- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models.
- 2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details!
*For more histories please turn to [here](docs/NEWS_OR_UPDATE.md)*
## Performance Comparison
### Inference Speed
> The result is generated using [this script](examples/benchmark/generation_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).
>
> The quantized model is loaded using the setup that can gain the fastest inference speed.
| model | GPU | num_beams | fp16 | gptq-int4 |
|---------------|---------------|-----------|-------|-----------|
| llama-7b | 1xA100-40G | 1 | 18.87 | 25.53 |
| llama-7b | 1xA100-40G | 4 | 68.79 | 91.30 |
| moss-moon 16b | 1xA100-40G | 1 | 12.48 | 15.25 |
| moss-moon 16b | 1xA100-40G | 4 | OOM | 42.67 |
| moss-moon 16b | 2xA100-40G | 1 | 06.83 | 06.78 |
| moss-moon 16b | 2xA100-40G | 4 | 13.10 | 10.80 |
| gpt-j 6b | 1xRTX3060-12G | 1 | OOM | 29.55 |
| gpt-j 6b | 1xRTX3060-12G | 4 | OOM | 47.36 |
### Perplexity
For perplexity comparison, you can turn to [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) and [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)
## Installation
AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:
| Platform version | Installation | Built against PyTorch |
|-------------------|---------------------------------------------------------------------------------------------------|-----------------------|
| CUDA 11.8 | `pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.2.1+cu118 |
| CUDA 12.1 | `pip install auto-gptq --no-build-isolation` | 2.2.1+cu121 |
| ROCm 5.7 | `pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/` | 2.2.1+rocm5.7
| Intel® Gaudi® 2 AI accelerator | `BUILD_CUDA_EXT=0 pip install auto-gptq --no-build-isolation` | [2.3.1+Intel Gaudi 1.17](https://docs.habana.ai/en/latest/Installation_Guide/) |
AutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton] --no-build-isolation` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).
For older AutoGPTQ, please refer to [the previous releases installation table](docs/INSTALLATION.md).
On NVIDIA systems, AutoGPTQ does not support [Maxwell or lower](https://qiita.com/uyuni/items/733a93b975b524f89f46) GPUs.
### Install from source
Clone the source code:
```bash
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
```
A few packages are required in order to build from source: `pip install numpy gekko pandas`.
Then, install locally from source:
```bash
pip install -vvv --no-build-isolation -e .
```
You can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation.
As a last resort, if the above command fails, you can try `python setup.py install`.
#### On ROCm systems
To install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example:
```bash
ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .
```
The compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices.
For ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build.
#### On Intel Gaudi 2 systems
To install from source for Intel Gaudi 2 HPUs, set the `BUILD_CUDA_EXT=0` environment variable to disable building the CUDA PyTorch extension. Example:
```bash
BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .
```
>Notice that Intel Gaudi 2 uses an optimized kernel upon inference, and requires `BUILD_CUDA_EXT=0` on non-CUDA machines.
## Quick Tour
### Quantization and Inference
> warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.
Below is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization:
```python
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)
# save quantized model
model.save_quantized(quantized_model_dir)
# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)
# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)
# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)
# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
```
For more advanced features of model quantization, please reference to [this script](examples/quantization/quant_with_alpaca.py)
### Customize Model
<details>
<summary>Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:</summary>
```python
from auto_gptq.modeling import BaseGPTQForCausalLM
class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
# chained attribute name of transformer layer block
layers_block_name = "model.decoder.layers"
# chained attribute names of other nn modules that in the same level as the transformer layer block
outside_layer_modules = [
"model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
"model.decoder.project_in", "model.decoder.final_layer_norm"
]
# chained attribute names of linear layers in transformer layer module
# normally, there are four sub lists, for each one the modules in it can be seen as one operation,
# and the order should be the order when they are truly executed, in this case (and usually in most cases),
# they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
inside_layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.out_proj"],
["fc1"],
["fc2"]
]
```
After this, you can use `OPTGPTQForCausalLM.from_pretrained` and other methods as shown in Basic.
</details>
### Evaluation on Downstream Tasks
You can use tasks defined in `auto_gptq.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.
The predefined tasks support all causal-language-models implemented in [🤗 transformers](https://github.com/huggingface/transformers) and in this project.
<details>
<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>
```python
from functools import partial
import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
0: "negative",
1: "neutral",
2: "positive"
}
LABELS = list(ID2LABEL.values())
def ds_refactor_fn(samples):
text_data = samples["text"]
label_data = samples["label"]
new_samples = {"prompt": [], "label": []}
for text, label in zip(text_data, label_data):
prompt = TEMPLATE.format(labels=LABELS, text=text)
new_samples["prompt"].append(prompt)
new_samples["label"].append(ID2LABEL[label])
return new_samples
# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)
task = SequenceClassificationTask(
model=model,
tokenizer=tokenizer,
classes=LABELS,
data_name_or_path=DATASET,
prompt_col_name="prompt",
label_col_name="label",
**{
"num_samples": 1000, # how many samples will be sampled to evaluation
"sample_max_len": 1024, # max tokens for each sample
"block_max_len": 2048, # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn": partial(datasets.load_dataset, name="english"),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn": ds_refactor_fn,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt": False
}
)
# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())
# self-consistency
print(
task.run(
generation_config=GenerationConfig(
num_beams=3,
num_return_sequences=3,
do_sample=True
)
)
)
```
</details>
## Learn More
[tutorials](docs/tutorial) provide step-by-step guidance to integrate `auto_gptq` with your own project and some best practice principles.
[examples](examples/README.md) provide plenty of example scripts to use `auto_gptq` in different ways.
## Supported Models
> you can use `model.config.model_type` to compare with the table below to check whether the model you use is supported by `auto_gptq`.
>
> for example, model_type of `WizardLM`, `vicuna` and `gpt4all` are all `llama`, hence they are all supported by `auto_gptq`.
| model type | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt |
|------------------------------------|--------------|-----------|-----------|---------------|-------------------------------------------------------------------------------------------------|
| bloom | ✅ | ✅ | ✅ | ✅ | |
| gpt2 | ✅ | ✅ | ✅ | ✅ | |
| gpt_neox | ✅ | ✅ | ✅ | ✅ | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| gptj | ✅ | ✅ | ✅ | ✅ | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| llama | ✅ | ✅ | ✅ | ✅ | ✅ |
| moss | ✅ | ✅ | ✅ | ✅ | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| opt | ✅ | ✅ | ✅ | ✅ | |
| gpt_bigcode | ✅ | ✅ | ✅ | ✅ | |
| codegen | ✅ | ✅ | ✅ | ✅ | |
| falcon(RefinedWebModel/RefinedWeb) | ✅ | ✅ | ✅ | ✅ | |
## Supported Evaluation Tasks
Currently, `auto_gptq` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!
## Running tests
Tests can be run with:
```
pytest tests/ -s
```
## FAQ
### Which kernel is used by default?
AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.
### How to use Marlin kernel?
Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with `use_marlin=True`. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).
## Acknowledgement
- Special thanks **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh** for proposing **GPTQ** algorithm and open source the [code](https://github.com/IST-DASLab/gptq), and for releasing [Marlin kernel](https://github.com/IST-DASLab/marlin) for mixed precision computation.
- Special thanks **qwopqwop200**, for code in this project that relevant to quantization are mainly referenced from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).
- Special thanks to **turboderp**, for releasing [Exllama](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) libraries with efficient mixed precision kernels.
<h1 align="center">AutoGPTQ</h1>
<p align="center">一个基于 GPTQ 算法,简单易用且拥有用户友好型接口的大语言模型量化工具包。</p>
<p align="center">
<a href="https://github.com/PanQiWei/AutoGPTQ/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg">
</a>
<a href="https://pypi.org/project/auto-gptq/">
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/auto-gptq">
</a>
</p>
<h4 align="center">
<p>
<a href="https://github.com/PanQiWei/AutoGPTQ/blob/main/README.md">English</a> |
<b>中文</b>
</p>
</h4>
Note: The English README is likely to be more up to date.
## 通向 v1.0.0 之路
嗨,社区的伙伴们,好久不见!很抱歉这段时间由于个人原因,我没能以较高的频率来更新这个项目。过去几周对我的职业生涯规划而言意义重大。在不久前,我正式告别了毕业后便加入两年之久的创业团队,非常感谢团队的领导和同事们给予我的信任与指导,让我能够在两年时间里飞速地成长;同时也十分感激团队允许我自 AutoGPTQ 项目创立以来一直无偿使用内部的 A100 GPU 服务器集群以完成各项实验与性能测评。(当然今后是无法继续使用了,因此**若有新的硬件赞助我将感激不尽**!)过去的两年里,我在这个团队中担任算法工程师的角色,负责基于大语言模型的对话系统架构设计与开发,我们曾成功推出一款名为 gemsouls 的产品,但不幸的是它已经停止运营。而现在,这个团队即将推出一款名为 [modelize](https://www.beta.modelize.ai/) 的新产品,**这是一个大模型原生的 AI 智能体平台,用户可以使用多个 AI 智能体搭建一个高度自动化的团队,让它们在工作流中相互合作,高效完成复杂的项目。**
话归正题,我非常兴奋地看到,在过去几个月的时间里,针对大语言模型推理性能优化的研究取得了巨大的进展,如今我们不仅能够在高端显卡上完成大语言模型的推理,甚至在 CPU 和边缘设备上都可以轻松运行大语言模型。一系列的技术进步,让我同样迫不及待地在开源社区上做出更多的贡献,因此,首先,我将用约四周的时间将 AutoGPTQ 迭代至 v1.0.0 正式版本,在此期间,也会有 2~3 个小版本发布以让用户能够及时体验性能优化和新特性。在我的愿景里,**到 v1.0.0 版本正式发布时,AutoGPTQ 将能够作为一个灵活可拓展的、支持所有 GPTQ-like 方法的量化后端,自动地完成各种基于 Pytorch 编写的大语言模型的量化工作**。我在[这里](https://github.com/PanQiWei/AutoGPTQ/issues/348)详细介绍了开发计划,欢迎移步至此进行讨论并给出你们的建议!
## 新闻或更新
- 2023-08-23 - (新闻) - 🤗 Transformers、optimum 和 peft 完成了对 `auto-gptq` 的集成,现在使用 GPTQ 模型进行推理和训练将变得更容易!阅读 [这篇博客](https://huggingface.co/blog/gptq-integration) 和相关资源以了解更多细节!
- 2023-08-21 - (新闻) - 通义千问团队发布了基于 `auto-gptq` 的 Qwen-7B 4bit 量化版本模型,并提供了[详尽的测评结果](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4#%E9%87%8F%E5%8C%96-quantization)
- 2023-08-06 - (更新) - 支持 exllama 的 q4 CUDA 算子使得 int4 量化模型能够获得至少1.3倍的推理速度提升.
- 2023-08-04 - (更新) - 支持 RoCm 使得 AMD GPU 的用户能够使用 auto-gptq 的 CUDA 拓展.
- 2023-07-26 - (更新) - 一个优雅的 [PPL 测评脚本](examples/benchmark/perplexity.py)以获得可以与诸如 `llama.cpp` 等代码库进行公平比较的结果。
- 2023-06-05 - (更新) - 集成 🤗 peft 来使用 gptq 量化过的模型训练适应层,支持 LoRA,AdaLoRA,AdaptionPrompt 等。
- 2023-05-30 - (更新) - 支持从 🤗 Hub 下载量化好的模型或上次量化好的模型到 🤗 Hub。
*获取更多的历史信息,请转至[这里](docs/NEWS_OR_UPDATE.md)*
## 性能对比
### 推理速度
> 以下结果通过[这个脚本](examples/benchmark/generation_speed.py)生成,文本输入的 batch size 为1,解码策略为 beam search 并且强制模型生成512个 token,速度的计量单位为 tokens/s(越大越好)。
>
> 量化模型通过能够最大化推理速度的方式加载。
| model | GPU | num_beams | fp16 | gptq-int4 |
|---------------|---------------|-----------|-------|-----------|
| llama-7b | 1xA100-40G | 1 | 18.87 | 25.53 |
| llama-7b | 1xA100-40G | 4 | 68.79 | 91.30 |
| moss-moon 16b | 1xA100-40G | 1 | 12.48 | 15.25 |
| moss-moon 16b | 1xA100-40G | 4 | OOM | 42.67 |
| moss-moon 16b | 2xA100-40G | 1 | 06.83 | 06.78 |
| moss-moon 16b | 2xA100-40G | 4 | 13.10 | 10.80 |
| gpt-j 6b | 1xRTX3060-12G | 1 | OOM | 29.55 |
| gpt-j 6b | 1xRTX3060-12G | 4 | OOM | 47.36 |
### 困惑度(PPL)
对于困惑度的对比, 你可以参考 [这里](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result)[这里](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)
## 安装
### 快速安装
你可以通过 pip 来安装与 PyTorch 2.0.1 相兼容的最新稳定版本的 AutoGPTQ 的预构建轮子文件:
* 对于 CUDA 11.7: `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/`
* 对于 CUDA 11.8: `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`
* 对于 RoCm 5.4.2: `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm542/`
**警告:** 预构建的轮子文件不一定在 PyTorch 的 nightly 版本上有效。如果要使用 PyTorch 的 nightly 版本,请从源码安装 AutoGPTQ。
#### 取消 cuda 拓展的安装
默认情况下,在 `torch``cuda` 已经于你的机器上被安装时,cuda 拓展将被自动安装,如果你不想要这些拓展的话,采用以下安装命令:
```shell
BUILD_CUDA_EXT=0 pip install auto-gptq
```
同时为确保该拓展——`autogptq_cuda` 不再存在于你的虚拟环境,执行以下命令:
```shell
pip uninstall autogptq_cuda -y
```
#### 支持使用 triton 加速
若想使用 `triton` 加速模型推理,使用以下命令:
> 警告:目前 triton 仅支持 linux 操作系统;当使用 triton 时 3-bit 数值类型的量化将不被支持
```shell
pip install auto-gptq[triton]
```
### 从源码安装
<details>
<summary>点击以查看详情</summary>
克隆源码:
```shell
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
```
然后,从项目目录安装:
```shell
pip install .
```
正如在快速安装一节,你可以使用 `BUILD_CUDA_EXT=0` 来取消构建 cuda 拓展。
如果你想要使用 triton 加速且其能够被你的操作系统所支持,请使用 `.[triton]`
对应 AMD GPUs,为了从源码安装以支持 RoCm,请设置 `ROCM_VERSION` 环境变量。同时通过设置 `PYTORCH_ROCM_ARCH` ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) 可提升编译速度,例如:对于 MI200 系列设备,该变量可设为 `gfx90a`。例子:
```
ROCM_VERSION=5.6 pip install .
```
对于 RoCm 系统,在从源码安装时额外需要提前安装以下包:`rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev`
</details>
## 快速开始
### 量化和推理
> 警告:这里仅是对 AutoGPTQ 中基本接口的用法展示,只使用了一条文本来量化一个特别小的模型,因此其结果的表现可能不如在大模型上执行量化后预期的那样好。
以下展示了使用 `auto_gptq` 进行量化和推理的最简单用法:
```python
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = BaseQuantizeConfig(
bits=4, # 将模型量化为 4-bit 数值类型
group_size=128, # 一般推荐将此参数的值设置为 128
desc_act=False, # 设为 False 可以显著提升推理速度,但是 ppl 可能会轻微地变差
)
# 加载未量化的模型,默认情况下,模型总是会被加载到 CPU 内存中
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
# 量化模型, 样本的数据类型应该为 List[Dict],其中字典的键有且仅有 input_ids 和 attention_mask
model.quantize(examples)
# 保存量化好的模型
model.save_quantized(quantized_model_dir)
# 使用 safetensors 保存量化好的模型
model.save_quantized(quantized_model_dir, use_safetensors=True)
# 将量化好的模型直接上传至 Hugging Face Hub
# 当使用 use_auth_token=True 时, 确保你已经首先使用 huggingface-cli login 进行了登录
# 或者可以使用 use_auth_token="hf_xxxxxxx" 来显式地添加账户认证 token
# (取消下面三行代码的注释来使用该功能)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
# 或者你也可以同时将量化好的模型保存到本地并上传至 Hugging Face Hub
# (取消下面三行代码的注释来使用该功能)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)
# 加载量化好的模型到能被识别到的第一块显卡中
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
# 从 Hugging Face Hub 下载量化好的模型并加载到能被识别到的第一块显卡中
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)
# 使用 model.generate 执行推理
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
# 或者使用 TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
```
参考 [此样例脚本](examples/quantization/quant_with_alpaca.py) 以了解进阶的用法。
### 自定义模型
<details>
<summary>以下展示了如何拓展 `auto_gptq` 以支持 `OPT` 模型,如你所见,这非常简单:</summary>
```python
from auto_gptq.modeling import BaseGPTQForCausalLM
class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
# chained attribute name of transformer layer block
layers_block_name = "model.decoder.layers"
# chained attribute names of other nn modules that in the same level as the transformer layer block
outside_layer_modules = [
"model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
"model.decoder.project_in", "model.decoder.final_layer_norm"
]
# chained attribute names of linear layers in transformer layer module
# normally, there are four sub lists, for each one the modules in it can be seen as one operation,
# and the order should be the order when they are truly executed, in this case (and usually in most cases),
# they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
inside_layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.out_proj"],
["fc1"],
["fc2"]
]
```
然后, 你就可以像在基本用法一节中展示的那样使用 `OPTGPTQForCausalLM.from_pretrained` 和其他方法。
</details>
### 在下游任务上执行评估
你可以使用在 `auto_gptq.eval_tasks` 中定义的任务来评估量化前后的模型在某个特定下游任务上的表现。
这些预定义的模型支持所有在 [🤗 transformers](https://github.com/huggingface/transformers)和本项目中被实现了的 causal-language-models。
<details>
<summary>以下是使用 `cardiffnlp/tweet_sentiment_multilingual` 数据集在序列分类(文本分类)任务上评估 `EleutherAI/gpt-j-6b` 模型的示例:</summary>
```python
from functools import partial
import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
0: "negative",
1: "neutral",
2: "positive"
}
LABELS = list(ID2LABEL.values())
def ds_refactor_fn(samples):
text_data = samples["text"]
label_data = samples["label"]
new_samples = {"prompt": [], "label": []}
for text, label in zip(text_data, label_data):
prompt = TEMPLATE.format(labels=LABELS, text=text)
new_samples["prompt"].append(prompt)
new_samples["label"].append(ID2LABEL[label])
return new_samples
# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)
task = SequenceClassificationTask(
model=model,
tokenizer=tokenizer,
classes=LABELS,
data_name_or_path=DATASET,
prompt_col_name="prompt",
label_col_name="label",
**{
"num_samples": 1000, # how many samples will be sampled to evaluation
"sample_max_len": 1024, # max tokens for each sample
"block_max_len": 2048, # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn": partial(datasets.load_dataset, name="english"),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn": ds_refactor_fn,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt": False
}
)
# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())
# self-consistency
print(
task.run(
generation_config=GenerationConfig(
num_beams=3,
num_return_sequences=3,
do_sample=True
)
)
)
```
</details>
## 了解更多
[教程](docs/tutorial) 提供了将 `auto_gptq` 集成到你的项目中的手把手指导和最佳实践准则。
[示例](examples/README.md) 提供了大量示例脚本以将 `auto_gptq` 用于不同领域。
## 支持的模型
> 你可以使用 `model.config.model_type` 来对照下表以检查你正在使用的一个模型是否被 `auto_gptq` 所支持。
>
> 比如, `WizardLM`,`vicuna` 和 `gpt4all` 模型的 `model_type` 皆为 `llama`, 因此这些模型皆被 `auto_gptq` 所支持。
| model type | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt |
|------------------------------------|--------------|-----------|-----------|---------------|-----------------------------------------------------------------------------------|
| bloom | ✅ | ✅ | ✅ | ✅ | |
| gpt2 | ✅ | ✅ | ✅ | ✅ | |
| gpt_neox | ✅ | ✅ | ✅ | ✅ | ✅[要求该分支的 peft](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| gptj | ✅ | ✅ | ✅ | ✅ | ✅[要求该分支的 peft](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| llama | ✅ | ✅ | ✅ | ✅ | ✅ |
| moss | ✅ | ✅ | ✅ | ✅ | ✅[要求该分支的 peft](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| opt | ✅ | ✅ | ✅ | ✅ | |
| gpt_bigcode | ✅ | ✅ | ✅ | ✅ | |
| codegen | ✅ | ✅ | ✅ | ✅ | |
| falcon(RefinedWebModel/RefinedWeb) | ✅ | ✅ | ✅ | ✅ | |
## 支持的评估任务
目前, `auto_gptq` 支持以下评估任务: `LanguageModelingTask`, `SequenceClassificationTask``TextSummarizationTask`;更多的评估任务即将到来!
## 致谢
- 特别感谢 **Elias Frantar****Saleh Ashkboos****Torsten Hoefler****Dan Alistarh** 提出 **GPTQ** 算法并开源[代码](https://github.com/IST-DASLab/gptq)
- 特别感谢 **qwopqwop200**, 本项目中涉及到模型量化的代码主要参考自 [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda)
[![Star History Chart](https://api.star-history.com/svg?repos=PanQiwei/AutoGPTQ&type=Date)](https://star-history.com/#PanQiWei/AutoGPTQ&Date)
from .modeling import AutoGPTQForCausalLM, BaseQuantizeConfig
from .utils.exllama_utils import exllama_set_max_input_length
from .utils.peft_utils import get_gptq_peft_model
__version__ = "0.8.0.dev0"
from .language_modeling_task import LanguageModelingTask
from .sequence_classification_task import SequenceClassificationTask, get_predictions
from .text_summarization_task import TextSummarizationTask
from abc import abstractmethod
from typing import Any, Dict, List, Optional, Union
import torch
from transformers import PreTrainedModel, PreTrainedTokenizer
from ..modeling import BaseGPTQForCausalLM
from ..utils.data_utils import get_dataloader
class BaseTask:
def __init__(
self,
model: Union[BaseGPTQForCausalLM, PreTrainedModel],
tokenizer: PreTrainedTokenizer,
data_name_or_path: str,
prompt_col_name: str,
label_col_name: str,
device: Optional[str] = None,
**kwargs,
):
self.model = model
self.tokenizer = tokenizer
if self.tokenizer.pad_token_id is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
self.model.config.pad_token_id = self.tokenizer.eos_token_id
self.dl = get_dataloader(
data_name_or_path,
prompt_col_name=prompt_col_name,
label_col_name=label_col_name,
tokenizer=tokenizer,
**kwargs,
)
self.device = device
if not self.device:
self.device = self.model.device
if isinstance(self.device, str):
self.device = torch.device(self.device)
@abstractmethod
def _predict(self, batch_data: Dict[str, Any], **kwargs) -> List[Any]:
pass
@abstractmethod
def _parse_labels(self, label_ids: torch.LongTensor) -> List[Any]:
pass
@abstractmethod
def _metric(self, pred: List[Any], label: List[Any]) -> Dict[str, float]:
pass
def run(self, **predict_kwargs) -> Dict[str, float]:
with torch.inference_mode(), torch.amp.autocast(device_type=self.device.type):
predictions = []
labels = []
for batch_data in self.dl:
for k, v in batch_data.items():
if isinstance(v, torch.Tensor):
batch_data[k] = v.to(self.device)
labels += self._parse_labels(batch_data["labels"])
predictions += self._predict(batch_data, **predict_kwargs)
return self._metric(predictions, labels)
import sys
from typing import List, Sequence
import numpy as np
def levenshtein_distance(seq1: Sequence, seq2: Sequence):
if seq1 == seq2:
return 0
num_rows = len(seq1) + 1
num_cols = len(seq2) + 1
dp_matrix = np.empty((num_rows, num_cols))
dp_matrix[0, :] = range(num_cols)
dp_matrix[:, 0] = range(num_rows)
for i in range(1, num_rows):
for j in range(1, num_cols):
if seq1[i - 1] == seq2[j - 1]:
dp_matrix[i, j] = dp_matrix[i - 1, j - 1]
else:
dp_matrix[i, j] = (
min(
dp_matrix[i - 1, j - 1],
dp_matrix[i - 1, j],
dp_matrix[i, j - 1],
)
+ 1
)
return dp_matrix[num_rows - 1, num_cols - 1]
def get_closest_label(pred: Sequence, classes: List[Sequence]) -> int:
min_id = sys.maxsize
min_edit_distance = sys.maxsize
for i, class_label in enumerate(classes):
edit_distance = levenshtein_distance(pred, class_label)
if edit_distance < min_edit_distance:
min_id = i
min_edit_distance = edit_distance
return min_id
__all__ = ["levenshtein_distance", "get_closest_label"]
from typing import List, Optional, Union
from torch import LongTensor
from transformers import PreTrainedTokenizer
def postprocess_generation_ids(
input_ids: LongTensor,
output_ids: LongTensor,
num_return_sequences: int,
tokenizer: Optional[PreTrainedTokenizer] = None,
pad_token_ids: Optional[int] = None,
) -> List[List[Union[str, List[int]]]]:
outputs = []
for idx, start in enumerate(range(0, len(output_ids), num_return_sequences)):
sub_output_ids = output_ids[start : start + num_return_sequences]
sub_generated_ids = sub_output_ids[..., input_ids[idx].size(0) :]
if tokenizer:
decoded_bach = (
generated_text
for generated_text in tokenizer.batch_decode(sub_generated_ids, clean_up_tokenization_spaces=True)
)
decoded_bach = list(decoded_bach)
outputs.append(decoded_bach)
else:
sub_generated_ids = sub_output_ids.cpu().numpy().tolist()
for i, one_sub_generated_ids in enumerate(sub_generated_ids):
if pad_token_ids is not None and pad_token_ids in one_sub_generated_ids:
one_sub_generated_ids = one_sub_generated_ids[: one_sub_generated_ids.index(pad_token_ids)]
sub_generated_ids[i] = one_sub_generated_ids
outputs.append(sub_generated_ids)
return outputs
__all__ = ["postprocess_generation_ids"]
import math
from typing import Any, Dict, List, Optional
from torch import LongTensor
from ._base import BaseTask
class LanguageModelingTask(BaseTask):
def __init__(
self,
model,
tokenizer,
data_name_or_path: str,
prompt_col_name: str,
label_col_name: str,
device: Optional[str] = None,
**kwargs,
):
kwargs["merge_prompt_label"] = True
super().__init__(
model=model,
tokenizer=tokenizer,
data_name_or_path=data_name_or_path,
prompt_col_name=prompt_col_name,
label_col_name=label_col_name,
device=device,
**kwargs,
)
def _predict(self, batch_data: Dict[str, Any], *args, **kwargs) -> List[float]:
outputs = self.model(**batch_data)
loss = outputs.loss.cpu().item()
return [loss]
def _parse_labels(self, label_ids: LongTensor) -> List[Any]:
return []
def _metric(self, pred: List[Any], label: List[Any]) -> Dict[str, float]:
return {"ppl": math.exp(sum(pred) / len(pred))}
def run(self) -> Dict[str, float]:
return super().run()
__all__ = ["LanguageModelingTask"]
from collections import Counter
from typing import Any, Dict, List, Optional
import numpy as np
from torch import LongTensor
from transformers import GenerationConfig, PreTrainedTokenizer
from ._base import BaseTask
from ._utils.classification_utils import get_closest_label
from ._utils.generation_utils import postprocess_generation_ids
def get_predictions(
input_ids: LongTensor,
output_ids: LongTensor,
num_return_sequences: int,
tokenizer: PreTrainedTokenizer,
classes: List[str],
) -> List[int]:
predictions = []
generated_texts = postprocess_generation_ids(
input_ids=input_ids,
output_ids=output_ids,
num_return_sequences=num_return_sequences,
tokenizer=tokenizer,
)
for sub_generated_texts in generated_texts:
sub_predictions = []
for gen_text in sub_generated_texts:
sub_predictions.append(get_closest_label(gen_text.lower().strip(), classes))
predictions.append(Counter(sub_predictions).most_common(1)[0][0])
return predictions
class SequenceClassificationTask(BaseTask):
def __init__(
self,
model,
tokenizer: PreTrainedTokenizer,
classes: List[str],
data_name_or_path: str,
prompt_col_name: str,
label_col_name: str,
device: Optional[str] = None,
**kwargs,
):
kwargs["merge_prompt_label"] = False
super().__init__(
model=model,
tokenizer=tokenizer,
data_name_or_path=data_name_or_path,
prompt_col_name=prompt_col_name,
label_col_name=label_col_name,
device=device,
**kwargs,
)
self.classes = [each.lower().strip() for each in classes]
classes_ids = self.tokenizer(classes)
self.max_new_tokens = max([len(each) for each in classes_ids])
def _predict(self, batch_data: Dict[str, Any], *args, **kwargs) -> List[int]:
generation_config = kwargs["generation_config"]
output_ids = self.model.generate(
input_ids=batch_data["input_ids"],
attention_mask=batch_data["attention_mask"],
generation_config=generation_config,
)
return get_predictions(
batch_data["input_ids"],
output_ids,
generation_config.num_return_sequences,
self.tokenizer,
self.classes,
)
def _parse_labels(self, label_ids: LongTensor) -> List[int]:
labels = []
for one_label_ids in label_ids:
one_label_ids = one_label_ids[(one_label_ids == -100).sum() :]
label = self.tokenizer.decode(one_label_ids, clean_up_tokenization_spaces=True).lower().strip()
label = get_closest_label(label, self.classes)
labels.append(label)
return labels
def _metric(self, pred: List[int], label: List[int]) -> Dict[str, float]:
pred = np.array(pred)
label = np.array(label)
acc = (pred == label).mean()
return {"acc": acc}
def run(self, generation_config: Optional[GenerationConfig] = None) -> Dict[str, float]:
if not generation_config:
generation_config = GenerationConfig(num_beams=1, do_sample=False, num_return_sequences=1)
generation_config.max_new_tokens = self.max_new_tokens
generation_config.eos_token_id = self.tokenizer.eos_token_id
generation_config.pad_token_id = self.tokenizer.pad_token_id
return super().run(generation_config=generation_config)
__all__ = ["SequenceClassificationTask"]
from typing import Any, Dict, List, Optional
import rouge
from torch import LongTensor
from transformers import GenerationConfig
from ._base import BaseTask
from ._utils.generation_utils import postprocess_generation_ids
class TextSummarizationTask(BaseTask):
def __init__(
self,
model,
tokenizer,
data_name_or_path: str,
prompt_col_name: str,
label_col_name: str,
device: Optional[str] = None,
**kwargs,
):
kwargs["merge_prompt_label"] = False
super().__init__(
model=model,
tokenizer=tokenizer,
data_name_or_path=data_name_or_path,
prompt_col_name=prompt_col_name,
label_col_name=label_col_name,
device=device,
**kwargs,
)
def _predict(self, batch_data: Dict[str, Any], *args, **kwargs) -> List[str]:
generation_config = kwargs["generation_config"]
output_ids = self.model.generate(
input_ids=batch_data["input_ids"],
attention_mask=batch_data["attention_mask"],
generation_config=generation_config,
)
return [
each[0].lower().strip()
for each in postprocess_generation_ids(
input_ids=batch_data["input_ids"],
output_ids=output_ids,
num_return_sequences=generation_config.num_return_sequences,
tokenizer=self.tokenizer,
)
]
def _parse_labels(self, label_ids: LongTensor) -> List[str]:
labels = []
for one_label_ids in label_ids:
one_label_ids = one_label_ids[(one_label_ids == -100).sum() :]
label = self.tokenizer.decode(one_label_ids).lower().strip()
labels.append(label)
return labels
def _metric(self, pred: List[Any], label: List[Any]) -> Dict[str, Dict[str, float]]:
metric = rouge.Rouge()
return metric.get_scores(hyps=pred, refs=label, avg=True)
def run(self, generation_config: Optional[GenerationConfig] = None) -> Dict[str, float]:
if not generation_config:
generation_config = GenerationConfig(num_beams=1, do_sample=False, max_new_tokens=128)
generation_config.num_return_sequences = 1
generation_config.eos_token_id = self.tokenizer.eos_token_id
generation_config.pad_token_id = self.tokenizer.pad_token_id
return super().run(generation_config=generation_config)
__all__ = ["TextSummarizationTask"]
from ._base import BaseGPTQForCausalLM, BaseQuantizeConfig
from .auto import GPTQ_CAUSAL_LM_MODEL_MAP, AutoGPTQForCausalLM
from .baichuan import BaiChuanGPTQForCausalLM
from .bloom import BloomGPTQForCausalLM
from .codegen import CodeGenGPTQForCausalLM
from .cohere import CohereGPTQForCausalLM
from .decilm import DeciLMGPTQForCausalLM
from .gemma import GemmaGPTQForCausalLM
from .gemma2 import Gemma2GPTQForCausalLM
from .gpt2 import GPT2GPTQForCausalLM
from .gpt_bigcode import GPTBigCodeGPTQForCausalLM
from .gpt_neox import GPTNeoXGPTQForCausalLM
from .gptj import GPTJGPTQForCausalLM
from .internlm import InternLMGPTQForCausalLM
from .llama import LlamaGPTQForCausalLM
from .longllama import LongLlamaGPTQForCausalLM
from .mistral import MistralGPTQForCausalLM
from .mixtral import MixtralGPTQForCausalLM
from .moss import MOSSGPTQForCausalLM
from .mpt import MPTGPTQForCausalLM
from .opt import OPTGPTQForCausalLM
from .phi import PhiGPTQForCausalLM
from .qwen import QwenGPTQForCausalLM
from .qwen2 import Qwen2GPTQForCausalLM
from .rw import RWGPTQForCausalLM
from .stablelmepoch import StableLMEpochGPTQForCausalLM
from .starcoder2 import Starcoder2GPTQForCausalLM
from .xverse import XverseGPTQForCausalLM
from .yi import YiGPTQForCausalLM
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment