Initial commit

da900c3b · yangql · da900c3b · da900c3b · da900c3b · da900c3b
Commit da900c3b authored Sep 19, 2024 by yangql
20 changed files
--- a/README.md
+++ b/README.md
+<h1 align="center">AutoGPTQ</h1>
+<p align="center">An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).</p>
+<p align="center">
+    <a href="https://github.com/PanQiWei/AutoGPTQ/releases">
+        <img alt="GitHub release" src="https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg">
+    </a>
+    <a href="https://pypi.org/project/auto-gptq/">
+        <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/auto-gptq">
+    </a>
+</p>
+<h4 align="center">
+    <p>
+        <b>English</b> |
+        <a href="https://github.com/PanQiWei/AutoGPTQ/blob/main/README_zh.md">中文</a>
+    </p>
+</h4>
+## News or Update
+- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models.
+- 2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details!
+*For more histories please turn to [here](docs/NEWS_OR_UPDATE.md)*
+## Performance Comparison
+### Inference Speed
+> The result is generated using [this script](examples/benchmark/generation_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).
+>
+> The quantized model is loaded using the setup that can gain the fastest inference speed.
+| model         | GPU           | num_beams | fp16  | gptq-int4 |
+|---------------|---------------|-----------|-------|-----------|
+| llama-7b      | 1xA100-40G    | 1         | 18.87 | 25.53     |
+| llama-7b      | 1xA100-40G    | 4         | 68.79 | 91.30     |
+| moss-moon 16b | 1xA100-40G    | 1         | 12.48 | 15.25     |
+| moss-moon 16b | 1xA100-40G    | 4         | OOM   | 42.67     |
+| moss-moon 16b | 2xA100-40G    | 1         | 06.83 | 06.78     |
+| moss-moon 16b | 2xA100-40G    | 4         | 13.10 | 10.80     |
+| gpt-j 6b      | 1xRTX3060-12G | 1         | OOM   | 29.55     |
+| gpt-j 6b      | 1xRTX3060-12G | 4         | OOM   | 47.36     |
+### Perplexity
+For perplexity comparison, you can turn to [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) and [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)
+## Installation
+AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:
+| Platform version | Installation                                                                                      | Built against PyTorch |
+|-------------------|---------------------------------------------------------------------------------------------------|-----------------------|
+| CUDA 11.8         | `pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`   | 2.2.1+cu118           |
+| CUDA 12.1         | `pip install auto-gptq --no-build-isolation`                                                                            | 2.2.1+cu121           |
+| ROCm 5.7          | `pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/` | 2.2.1+rocm5.7
+|  Intel® Gaudi® 2 AI accelerator          | `BUILD_CUDA_EXT=0 pip install auto-gptq --no-build-isolation` | [2.3.1+Intel Gaudi 1.17](https://docs.habana.ai/en/latest/Installation_Guide/)               |
+AutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton] --no-build-isolation` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).
+For older AutoGPTQ, please refer to [the previous releases installation table](docs/INSTALLATION.md).
+On NVIDIA systems, AutoGPTQ does not support [Maxwell or lower](https://qiita.com/uyuni/items/733a93b975b524f89f46) GPUs.
+### Install from source
+Clone the source code:
+```bash
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+```
+A few packages are required in order to build from source: `pip install numpy gekko pandas`.
+Then, install locally from source:
+```bash
+pip install -vvv --no-build-isolation -e .
+```
+You can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation.
+As a last resort, if the above command fails, you can try `python setup.py install`.
+#### On ROCm systems
+To install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example:
+```bash
+ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .
+```
+The compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices.
+For ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build.
+#### On Intel Gaudi 2 systems
+To install from source for Intel Gaudi 2 HPUs, set the `BUILD_CUDA_EXT=0` environment variable to disable building the CUDA PyTorch extension. Example:
+```bash
+BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .
+```
+>Notice that Intel Gaudi 2 uses an optimized kernel upon inference, and requires `BUILD_CUDA_EXT=0` on non-CUDA machines.
+## Quick Tour
+### Quantization and Inference
+> warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.
+Below is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization:
+```python
+from transformers import AutoTokenizer, TextGenerationPipeline
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+import logging
+logging.basicConfig(
+    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
+)
+pretrained_model_dir = "facebook/opt-125m"
+quantized_model_dir = "opt-125m-4bit"
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
+examples = [
+    tokenizer(
+        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
+    )
+]
+quantize_config = BaseQuantizeConfig(
+    bits=4,  # quantize model to 4-bit
+    group_size=128,  # it is recommended to set the value to 128
+    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
+)
+# load un-quantized model, by default, the model will always be loaded into CPU memory
+model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
+# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
+model.quantize(examples)
+# save quantized model
+model.save_quantized(quantized_model_dir)
+# save quantized model using safetensors
+model.save_quantized(quantized_model_dir, use_safetensors=True)
+# push quantized model to Hugging Face Hub.
+# to use use_auth_token=True, Login first via huggingface-cli login.
+# or pass explcit token with: use_auth_token="hf_xxxxxxx"
+# (uncomment the following three lines to enable this feature)
+# repo_id = f"YourUserName/{quantized_model_dir}"
+# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
+# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
+# alternatively you can save and push at the same time
+# (uncomment the following three lines to enable this feature)
+# repo_id = f"YourUserName/{quantized_model_dir}"
+# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
+# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)
+# load quantized model to the first GPU
+model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
+# download quantized model from Hugging Face Hub and load to the first GPU
+# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)
+# inference with model.generate
+print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
+# or you can also use pipeline
+pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
+print(pipeline("auto-gptq is")[0]["generated_text"])
+```
+For more advanced features of model quantization, please reference to [this script](examples/quantization/quant_with_alpaca.py)
+### Customize Model
+<details>
+<summary>Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:</summary>
+```python
+from auto_gptq.modeling import BaseGPTQForCausalLM
+class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
+    # chained attribute name of transformer layer block
+    layers_block_name = "model.decoder.layers"
+    # chained attribute names of other nn modules that in the same level as the transformer layer block
+    outside_layer_modules = [
+        "model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
+        "model.decoder.project_in", "model.decoder.final_layer_norm"
+    ]
+    # chained attribute names of linear layers in transformer layer module
+    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
+    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
+    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
+    inside_layer_modules = [
+        ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
+        ["self_attn.out_proj"],
+        ["fc1"],
+        ["fc2"]
+    ]
+```
+After this, you can use `OPTGPTQForCausalLM.from_pretrained` and other methods as shown in Basic.
+</details>
+### Evaluation on Downstream Tasks
+You can use tasks defined in `auto_gptq.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.
+The predefined tasks support all causal-language-models implemented in [🤗 transformers](https://github.com/huggingface/transformers) and in this project.
+<details>
+<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>
+```python
+from functools import partial
+import datasets
+from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+from auto_gptq.eval_tasks import SequenceClassificationTask
+MODEL = "EleutherAI/gpt-j-6b"
+DATASET = "cardiffnlp/tweet_sentiment_multilingual"
+TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
+ID2LABEL = {
+    0: "negative",
+    1: "neutral",
+    2: "positive"
+}
+LABELS = list(ID2LABEL.values())
+def ds_refactor_fn(samples):
+    text_data = samples["text"]
+    label_data = samples["label"]
+    new_samples = {"prompt": [], "label": []}
+    for text, label in zip(text_data, label_data):
+        prompt = TEMPLATE.format(labels=LABELS, text=text)
+        new_samples["prompt"].append(prompt)
+        new_samples["label"].append(ID2LABEL[label])
+    return new_samples
+#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
+model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
+tokenizer = AutoTokenizer.from_pretrained(MODEL)
+task = SequenceClassificationTask(
+        model=model,
+        tokenizer=tokenizer,
+        classes=LABELS,
+        data_name_or_path=DATASET,
+        prompt_col_name="prompt",
+        label_col_name="label",
+        **{
+            "num_samples": 1000,  # how many samples will be sampled to evaluation
+            "sample_max_len": 1024,  # max tokens for each sample
+            "block_max_len": 2048,  # max tokens for each data block
+            # function to load dataset, one must only accept data_name_or_path as input
+            # and return datasets.Dataset
+            "load_fn": partial(datasets.load_dataset, name="english"),
+            # function to preprocess dataset, which is used for datasets.Dataset.map,
+            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
+            "preprocess_fn": ds_refactor_fn,
+            # truncate label when sample's length exceed sample_max_len
+            "truncate_prompt": False
+        }
+    )
+# note that max_new_tokens will be automatically specified internally based on given classes
+print(task.run())
+# self-consistency
+print(
+    task.run(
+        generation_config=GenerationConfig(
+            num_beams=3,
+            num_return_sequences=3,
+            do_sample=True
+        )
+    )
+)
+```
+</details>
+## Learn More
+[tutorials](docs/tutorial) provide step-by-step guidance to integrate `auto_gptq` with your own project and some best practice principles.
+[examples](examples/README.md) provide plenty of example scripts to use `auto_gptq` in different ways.
+## Supported Models
+> you can use `model.config.model_type` to compare with the table below to check whether the model you use is supported by `auto_gptq`.
+>
+> for example, model_type of `WizardLM`, `vicuna` and `gpt4all` are all `llama`, hence they are all supported by `auto_gptq`.
+| model type                         | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt                                                                            |
+|------------------------------------|--------------|-----------|-----------|---------------|-------------------------------------------------------------------------------------------------|
+| bloom                              | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
+| gpt2                               | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
+| gpt_neox                           | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
+| gptj                               | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
+| llama                              | ✅            | ✅         | ✅         | ✅             | ✅                                                                                               |
+| moss                               | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
+| opt                                | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
+| gpt_bigcode                        | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
+| codegen                            | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
+| falcon(RefinedWebModel/RefinedWeb) | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
+## Supported Evaluation Tasks
+Currently, `auto_gptq` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!
+## Running tests
+Tests can be run with:
+```
+pytest tests/ -s
+```
+## FAQ
+### Which kernel is used by default?
+AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.
+### How to use Marlin kernel?
+Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with `use_marlin=True`. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).
+## Acknowledgement
+- Special thanks **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh** for proposing **GPTQ** algorithm and open source the [code](https://github.com/IST-DASLab/gptq), and for releasing [Marlin kernel](https://github.com/IST-DASLab/marlin) for mixed precision computation.
+- Special thanks **qwopqwop200**, for code in this project that relevant to quantization are mainly referenced from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).
+- Special thanks to **turboderp**, for releasing [Exllama](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) libraries with efficient mixed precision kernels.
--- a/README_zh.md
+++ b/README_zh.md
+<h1 align="center">AutoGPTQ</h1>
+<p align="center">一个基于 GPTQ 算法，简单易用且拥有用户友好型接口的大语言模型量化工具包。</p>
+<p align="center">
+    <a href="https://github.com/PanQiWei/AutoGPTQ/releases">
+        <img alt="GitHub release" src="https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg">
+    </a>
+    <a href="https://pypi.org/project/auto-gptq/">
+        <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/auto-gptq">
+    </a>
+</p>
+<h4 align="center">
+    <p>
+        <a href="https://github.com/PanQiWei/AutoGPTQ/blob/main/README.md">English</a> |
+        <b>中文</b>
+    </p>
+</h4>
+Note: The English README is likely to be more up to date.
+## 通向 v1.0.0 之路
+嗨，社区的伙伴们，好久不见！很抱歉这段时间由于个人原因，我没能以较高的频率来更新这个项目。过去几周对我的职业生涯规划而言意义重大。在不久前，我正式告别了毕业后便加入两年之久的创业团队，非常感谢团队的领导和同事们给予我的信任与指导，让我能够在两年时间里飞速地成长；同时也十分感激团队允许我自 AutoGPTQ 项目创立以来一直无偿使用内部的 A100 GPU 服务器集群以完成各项实验与性能测评。（当然今后是无法继续使用了，因此**若有新的硬件赞助我将感激不尽**！）过去的两年里，我在这个团队中担任算法工程师的角色，负责基于大语言模型的对话系统架构设计与开发，我们曾成功推出一款名为 gemsouls 的产品，但不幸的是它已经停止运营。而现在，这个团队即将推出一款名为 [modelize](https://www.beta.modelize.ai/) 的新产品，**这是一个大模型原生的 AI 智能体平台，用户可以使用多个 AI 智能体搭建一个高度自动化的团队，让它们在工作流中相互合作，高效完成复杂的项目。**
+话归正题，我非常兴奋地看到，在过去几个月的时间里，针对大语言模型推理性能优化的研究取得了巨大的进展，如今我们不仅能够在高端显卡上完成大语言模型的推理，甚至在 CPU 和边缘设备上都可以轻松运行大语言模型。一系列的技术进步，让我同样迫不及待地在开源社区上做出更多的贡献，因此，首先，我将用约四周的时间将 AutoGPTQ 迭代至 v1.0.0 正式版本，在此期间，也会有 2~3 个小版本发布以让用户能够及时体验性能优化和新特性。在我的愿景里，**到 v1.0.0 版本正式发布时，AutoGPTQ 将能够作为一个灵活可拓展的、支持所有 GPTQ-like 方法的量化后端，自动地完成各种基于 Pytorch 编写的大语言模型的量化工作**。我在[这里](https://github.com/PanQiWei/AutoGPTQ/issues/348)详细介绍了开发计划，欢迎移步至此进行讨论并给出你们的建议！
+## 新闻或更新
+- 2023-08-23 - (新闻) - 🤗 Transformers、optimum 和 peft 完成了对 `auto-gptq` 的集成，现在使用 GPTQ 模型进行推理和训练将变得更容易！阅读 [这篇博客](https://huggingface.co/blog/gptq-integration) 和相关资源以了解更多细节！
+- 2023-08-21 - (新闻) - 通义千问团队发布了基于 `auto-gptq` 的 Qwen-7B 4bit 量化版本模型，并提供了[详尽的测评结果](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4#%E9%87%8F%E5%8C%96-quantization)
+- 2023-08-06 - (更新) - 支持 exllama 的 q4 CUDA 算子使得 int4 量化模型能够获得至少1.3倍的推理速度提升.
+- 2023-08-04 - (更新) - 支持 RoCm 使得 AMD GPU 的用户能够使用 auto-gptq 的 CUDA 拓展.
+- 2023-07-26 - (更新) - 一个优雅的 [PPL 测评脚本](examples/benchmark/perplexity.py)以获得可以与诸如 `llama.cpp` 等代码库进行公平比较的结果。
+- 2023-06-05 - (更新) - 集成 🤗 peft 来使用 gptq 量化过的模型训练适应层，支持 LoRA，AdaLoRA，AdaptionPrompt 等。
+- 2023-05-30 - (更新) - 支持从 🤗 Hub 下载量化好的模型或上次量化好的模型到 🤗 Hub。
+*获取更多的历史信息，请转至[这里](docs/NEWS_OR_UPDATE.md)*
+## 性能对比
+### 推理速度
+> 以下结果通过[这个脚本](examples/benchmark/generation_speed.py)生成，文本输入的 batch size 为1，解码策略为 beam search 并且强制模型生成512个 token，速度的计量单位为 tokens/s（越大越好）。
+> 
+> 量化模型通过能够最大化推理速度的方式加载。
+| model         | GPU           | num_beams | fp16  | gptq-int4 |
+|---------------|---------------|-----------|-------|-----------|
+| llama-7b      | 1xA100-40G    | 1         | 18.87 | 25.53     |
+| llama-7b      | 1xA100-40G    | 4         | 68.79 | 91.30     |
+| moss-moon 16b | 1xA100-40G    | 1         | 12.48 | 15.25     |
+| moss-moon 16b | 1xA100-40G    | 4         | OOM   | 42.67     |
+| moss-moon 16b | 2xA100-40G    | 1         | 06.83 | 06.78     |
+| moss-moon 16b | 2xA100-40G    | 4         | 13.10 | 10.80     |
+| gpt-j 6b      | 1xRTX3060-12G | 1         | OOM   | 29.55     |
+| gpt-j 6b      | 1xRTX3060-12G | 4         | OOM   | 47.36     |
+### 困惑度（PPL）
+对于困惑度的对比， 你可以参考 [这里](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) 和 [这里](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)
+## 安装
+### 快速安装
+你可以通过 pip 来安装与 PyTorch 2.0.1 相兼容的最新稳定版本的 AutoGPTQ 的预构建轮子文件：
+* 对于 CUDA 11.7： `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/`
+* 对于 CUDA 11.8： `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`
+* 对于 RoCm 5.4.2： `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm542/`
+**警告：** 预构建的轮子文件不一定在 PyTorch 的 nightly 版本上有效。如果要使用 PyTorch 的 nightly 版本，请从源码安装 AutoGPTQ。
+#### 取消 cuda 拓展的安装
+默认情况下，在 `torch` 和 `cuda` 已经于你的机器上被安装时，cuda 拓展将被自动安装，如果你不想要这些拓展的话，采用以下安装命令：
+```shell
+BUILD_CUDA_EXT=0 pip install auto-gptq
+```
+同时为确保该拓展——`autogptq_cuda` 不再存在于你的虚拟环境，执行以下命令：
+```shell
+pip uninstall autogptq_cuda -y
+```
+#### 支持使用 triton 加速
+若想使用 `triton` 加速模型推理，使用以下命令：
+> 警告：目前 triton 仅支持 linux 操作系统；当使用 triton 时 3-bit 数值类型的量化将不被支持
+```shell
+pip install auto-gptq[triton]
+```
+### 从源码安装
+<details>
+<summary>点击以查看详情</summary>
+克隆源码:
+```shell
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+```
+然后，从项目目录安装:
+```shell
+pip install .
+```
+正如在快速安装一节，你可以使用 `BUILD_CUDA_EXT=0` 来取消构建 cuda 拓展。
+如果你想要使用 triton 加速且其能够被你的操作系统所支持，请使用 `.[triton]`。
+对应 AMD GPUs，为了从源码安装以支持 RoCm，请设置 `ROCM_VERSION` 环境变量。同时通过设置 `PYTORCH_ROCM_ARCH` ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) 可提升编译速度，例如：对于 MI200 系列设备，该变量可设为 `gfx90a`。例子：
+```
+ROCM_VERSION=5.6 pip install .
+```
+对于 RoCm 系统，在从源码安装时额外需要提前安装以下包：`rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev`。
+</details>
+## 快速开始
+### 量化和推理
+> 警告：这里仅是对 AutoGPTQ 中基本接口的用法展示，只使用了一条文本来量化一个特别小的模型，因此其结果的表现可能不如在大模型上执行量化后预期的那样好。
+以下展示了使用 `auto_gptq` 进行量化和推理的最简单用法：
+```python
+from transformers import AutoTokenizer, TextGenerationPipeline
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+pretrained_model_dir = "facebook/opt-125m"
+quantized_model_dir = "opt-125m-4bit"
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
+examples = [
+    tokenizer(
+        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
+    )
+]
+quantize_config = BaseQuantizeConfig(
+    bits=4,  # 将模型量化为 4-bit 数值类型
+    group_size=128,  # 一般推荐将此参数的值设置为 128
+    desc_act=False,  # 设为 False 可以显著提升推理速度，但是 ppl 可能会轻微地变差
+)
+# 加载未量化的模型，默认情况下，模型总是会被加载到 CPU 内存中
+model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
+# 量化模型, 样本的数据类型应该为 List[Dict]，其中字典的键有且仅有 input_ids 和 attention_mask
+model.quantize(examples)
+# 保存量化好的模型
+model.save_quantized(quantized_model_dir)
+# 使用 safetensors 保存量化好的模型
+model.save_quantized(quantized_model_dir, use_safetensors=True)
+# 将量化好的模型直接上传至 Hugging Face Hub 
+# 当使用 use_auth_token=True 时, 确保你已经首先使用 huggingface-cli login 进行了登录
+# 或者可以使用 use_auth_token="hf_xxxxxxx" 来显式地添加账户认证 token
+# （取消下面三行代码的注释来使用该功能）
+# repo_id = f"YourUserName/{quantized_model_dir}"
+# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
+# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
+# 或者你也可以同时将量化好的模型保存到本地并上传至 Hugging Face Hub
+# （取消下面三行代码的注释来使用该功能）
+# repo_id = f"YourUserName/{quantized_model_dir}"
+# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
+# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)
+# 加载量化好的模型到能被识别到的第一块显卡中
+model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
+# 从 Hugging Face Hub 下载量化好的模型并加载到能被识别到的第一块显卡中
+# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)
+# 使用 model.generate 执行推理
+print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
+# 或者使用 TextGenerationPipeline
+pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
+print(pipeline("auto-gptq is")[0]["generated_text"])
+```
+参考 [此样例脚本](examples/quantization/quant_with_alpaca.py) 以了解进阶的用法。
+### 自定义模型
+<details>
+<summary>以下展示了如何拓展 `auto_gptq` 以支持 `OPT` 模型，如你所见，这非常简单：</summary>
+```python
+from auto_gptq.modeling import BaseGPTQForCausalLM
+class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
+    # chained attribute name of transformer layer block
+    layers_block_name = "model.decoder.layers"
+    # chained attribute names of other nn modules that in the same level as the transformer layer block
+    outside_layer_modules = [
+        "model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
+        "model.decoder.project_in", "model.decoder.final_layer_norm"
+    ]
+    # chained attribute names of linear layers in transformer layer module
+    # normally, there are four sub lists, for each one the modules in it can be seen as one operation, 
+    # and the order should be the order when they are truly executed, in this case (and usually in most cases), 
+    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
+    inside_layer_modules = [
+        ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
+        ["self_attn.out_proj"],
+        ["fc1"],
+        ["fc2"]
+    ]
+```
+然后, 你就可以像在基本用法一节中展示的那样使用 `OPTGPTQForCausalLM.from_pretrained` 和其他方法。
+</details>
+### 在下游任务上执行评估
+你可以使用在 `auto_gptq.eval_tasks` 中定义的任务来评估量化前后的模型在某个特定下游任务上的表现。
+这些预定义的模型支持所有在 [🤗 transformers](https://github.com/huggingface/transformers)和本项目中被实现了的 causal-language-models。
+<details>
+<summary>以下是使用 `cardiffnlp/tweet_sentiment_multilingual` 数据集在序列分类（文本分类）任务上评估 `EleutherAI/gpt-j-6b` 模型的示例:</summary>
+```python
+from functools import partial
+import datasets
+from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+from auto_gptq.eval_tasks import SequenceClassificationTask
+MODEL = "EleutherAI/gpt-j-6b"
+DATASET = "cardiffnlp/tweet_sentiment_multilingual"
+TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
+ID2LABEL = {
+    0: "negative",
+    1: "neutral",
+    2: "positive"
+}
+LABELS = list(ID2LABEL.values())
+def ds_refactor_fn(samples):
+    text_data = samples["text"]
+    label_data = samples["label"]
+    new_samples = {"prompt": [], "label": []}
+    for text, label in zip(text_data, label_data):
+        prompt = TEMPLATE.format(labels=LABELS, text=text)
+        new_samples["prompt"].append(prompt)
+        new_samples["label"].append(ID2LABEL[label])
+    return new_samples
+#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
+model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
+tokenizer = AutoTokenizer.from_pretrained(MODEL)
+task = SequenceClassificationTask(
+        model=model,
+        tokenizer=tokenizer,
+        classes=LABELS,
+        data_name_or_path=DATASET,
+        prompt_col_name="prompt",
+        label_col_name="label",
+        **{
+            "num_samples": 1000,  # how many samples will be sampled to evaluation
+            "sample_max_len": 1024,  # max tokens for each sample
+            "block_max_len": 2048,  # max tokens for each data block
+            # function to load dataset, one must only accept data_name_or_path as input 
+            # and return datasets.Dataset
+            "load_fn": partial(datasets.load_dataset, name="english"),  
+            # function to preprocess dataset, which is used for datasets.Dataset.map, 
+            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
+            "preprocess_fn": ds_refactor_fn,  
+            # truncate label when sample's length exceed sample_max_len
+            "truncate_prompt": False  
+        }
+    )
+# note that max_new_tokens will be automatically specified internally based on given classes
+print(task.run())
+# self-consistency
+print(
+    task.run(
+        generation_config=GenerationConfig(
+            num_beams=3,
+            num_return_sequences=3,
+            do_sample=True
+        )
+    )
+)
+```
+</details>
+## 了解更多
+[教程](docs/tutorial) 提供了将 `auto_gptq` 集成到你的项目中的手把手指导和最佳实践准则。
+[示例](examples/README.md) 提供了大量示例脚本以将 `auto_gptq` 用于不同领域。
+## 支持的模型
+> 你可以使用 `model.config.model_type` 来对照下表以检查你正在使用的一个模型是否被 `auto_gptq` 所支持。
+> 
+> 比如， `WizardLM`，`vicuna` 和 `gpt4all` 模型的 `model_type` 皆为 `llama`， 因此这些模型皆被 `auto_gptq` 所支持。
+| model type                         | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt                                                              |
+|------------------------------------|--------------|-----------|-----------|---------------|-----------------------------------------------------------------------------------|
+| bloom                              | ✅            | ✅         | ✅         | ✅             |                                                                                   |
+| gpt2                               | ✅            | ✅         | ✅         | ✅             |                                                                                   |
+| gpt_neox                           | ✅            | ✅         | ✅         | ✅             | ✅[要求该分支的 peft](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
+| gptj                               | ✅            | ✅         | ✅         | ✅             | ✅[要求该分支的 peft](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
+| llama                              | ✅            | ✅         | ✅         | ✅             | ✅                                                                                 |
+| moss                               | ✅            | ✅         | ✅         | ✅             | ✅[要求该分支的 peft](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
+| opt                                | ✅            | ✅         | ✅         | ✅             |                                                                                   |
+| gpt_bigcode                        | ✅            | ✅         | ✅         | ✅             |                                                                                   |
+| codegen                            | ✅            | ✅         | ✅         | ✅             |                                                                                   |
+| falcon(RefinedWebModel/RefinedWeb) | ✅            | ✅         | ✅         | ✅             |                                                                                   |
+## 支持的评估任务
+目前， `auto_gptq` 支持以下评估任务： `LanguageModelingTask`, `SequenceClassificationTask` 和 `TextSummarizationTask`；更多的评估任务即将到来！
+## 致谢
+- 特别感谢 **Elias Frantar**， **Saleh Ashkboos**， **Torsten Hoefler** 和 **Dan Alistarh** 提出 **GPTQ** 算法并开源[代码](https://github.com/IST-DASLab/gptq)。
+- 特别感谢 **qwopqwop200**， 本项目中涉及到模型量化的代码主要参考自 [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda)。
+[![Star History Chart](https://api.star-history.com/svg?repos=PanQiwei/AutoGPTQ&type=Date)](https://star-history.com/#PanQiWei/AutoGPTQ&Date)
--- a/auto_gptq/__init__.py
+++ b/auto_gptq/__init__.py
+from .modeling import AutoGPTQForCausalLM, BaseQuantizeConfig
+from .utils.exllama_utils import exllama_set_max_input_length
+from .utils.peft_utils import get_gptq_peft_model
+__version__ = "0.8.0.dev0"
--- a/auto_gptq/__pycache__/__init__.cpython-310.pyc
+++ b/auto_gptq/__pycache__/__init__.cpython-310.pyc
--- a/auto_gptq/eval_tasks/__init__.py
+++ b/auto_gptq/eval_tasks/__init__.py
+from .language_modeling_task import LanguageModelingTask
+from .sequence_classification_task import SequenceClassificationTask, get_predictions
+from .text_summarization_task import TextSummarizationTask
--- a/auto_gptq/eval_tasks/_base.py
+++ b/auto_gptq/eval_tasks/_base.py
+from abc import abstractmethod
+from typing import Any, Dict, List, Optional, Union
+import torch
+from transformers import PreTrainedModel, PreTrainedTokenizer
+from ..modeling import BaseGPTQForCausalLM
+from ..utils.data_utils import get_dataloader
+class BaseTask:
+    def __init__(
+        self,
+        model: Union[BaseGPTQForCausalLM, PreTrainedModel],
+        tokenizer: PreTrainedTokenizer,
+        data_name_or_path: str,
+        prompt_col_name: str,
+        label_col_name: str,
+        device: Optional[str] = None,
+        **kwargs,
+    ):
+        self.model = model
+        self.tokenizer = tokenizer
+        if self.tokenizer.pad_token_id is None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
+            self.model.config.pad_token_id = self.tokenizer.eos_token_id
+        self.dl = get_dataloader(
+            data_name_or_path,
+            prompt_col_name=prompt_col_name,
+            label_col_name=label_col_name,
+            tokenizer=tokenizer,
+            **kwargs,
+        )
+        self.device = device
+        if not self.device:
+            self.device = self.model.device
+        if isinstance(self.device, str):
+            self.device = torch.device(self.device)
+    @abstractmethod
+    def _predict(self, batch_data: Dict[str, Any], **kwargs) -> List[Any]:
+        pass
+    @abstractmethod
+    def _parse_labels(self, label_ids: torch.LongTensor) -> List[Any]:
+        pass
+    @abstractmethod
+    def _metric(self, pred: List[Any], label: List[Any]) -> Dict[str, float]:
+        pass
+    def run(self, **predict_kwargs) -> Dict[str, float]:
+        with torch.inference_mode(), torch.amp.autocast(device_type=self.device.type):
+            predictions = []
+            labels = []
+            for batch_data in self.dl:
+                for k, v in batch_data.items():
+                    if isinstance(v, torch.Tensor):
+                        batch_data[k] = v.to(self.device)
+                labels += self._parse_labels(batch_data["labels"])
+                predictions += self._predict(batch_data, **predict_kwargs)
+        return self._metric(predictions, labels)
--- a/auto_gptq/eval_tasks/_utils/__init__.py
+++ b/auto_gptq/eval_tasks/_utils/__init__.py
--- a/auto_gptq/eval_tasks/_utils/classification_utils.py
+++ b/auto_gptq/eval_tasks/_utils/classification_utils.py
+import sys
+from typing import List, Sequence
+import numpy as np
+def levenshtein_distance(seq1: Sequence, seq2: Sequence):
+    if seq1 == seq2:
+        return 0
+    num_rows = len(seq1) + 1
+    num_cols = len(seq2) + 1
+    dp_matrix = np.empty((num_rows, num_cols))
+    dp_matrix[0, :] = range(num_cols)
+    dp_matrix[:, 0] = range(num_rows)
+    for i in range(1, num_rows):
+        for j in range(1, num_cols):
+            if seq1[i - 1] == seq2[j - 1]:
+                dp_matrix[i, j] = dp_matrix[i - 1, j - 1]
+            else:
+                dp_matrix[i, j] = (
+                    min(
+                        dp_matrix[i - 1, j - 1],
+                        dp_matrix[i - 1, j],
+                        dp_matrix[i, j - 1],
+                    )
+                    + 1
+                )
+    return dp_matrix[num_rows - 1, num_cols - 1]
+def get_closest_label(pred: Sequence, classes: List[Sequence]) -> int:
+    min_id = sys.maxsize
+    min_edit_distance = sys.maxsize
+    for i, class_label in enumerate(classes):
+        edit_distance = levenshtein_distance(pred, class_label)
+        if edit_distance < min_edit_distance:
+            min_id = i
+            min_edit_distance = edit_distance
+    return min_id
+__all__ = ["levenshtein_distance", "get_closest_label"]
--- a/auto_gptq/eval_tasks/_utils/generation_utils.py
+++ b/auto_gptq/eval_tasks/_utils/generation_utils.py
+from typing import List, Optional, Union
+from torch import LongTensor
+from transformers import PreTrainedTokenizer
+def postprocess_generation_ids(
+    input_ids: LongTensor,
+    output_ids: LongTensor,
+    num_return_sequences: int,
+    tokenizer: Optional[PreTrainedTokenizer] = None,
+    pad_token_ids: Optional[int] = None,
+) -> List[List[Union[str, List[int]]]]:
+    outputs = []
+    for idx, start in enumerate(range(0, len(output_ids), num_return_sequences)):
+        sub_output_ids = output_ids[start : start + num_return_sequences]
+        sub_generated_ids = sub_output_ids[..., input_ids[idx].size(0) :]
+        if tokenizer:
+            decoded_bach = (
+                generated_text
+                for generated_text in tokenizer.batch_decode(sub_generated_ids, clean_up_tokenization_spaces=True)
+            )
+            decoded_bach = list(decoded_bach)
+            outputs.append(decoded_bach)
+        else:
+            sub_generated_ids = sub_output_ids.cpu().numpy().tolist()
+            for i, one_sub_generated_ids in enumerate(sub_generated_ids):
+                if pad_token_ids is not None and pad_token_ids in one_sub_generated_ids:
+                    one_sub_generated_ids = one_sub_generated_ids[: one_sub_generated_ids.index(pad_token_ids)]
+                sub_generated_ids[i] = one_sub_generated_ids
+            outputs.append(sub_generated_ids)
+    return outputs
+__all__ = ["postprocess_generation_ids"]
--- a/auto_gptq/eval_tasks/language_modeling_task.py
+++ b/auto_gptq/eval_tasks/language_modeling_task.py
+import math
+from typing import Any, Dict, List, Optional
+from torch import LongTensor
+from ._base import BaseTask
+class LanguageModelingTask(BaseTask):
+    def __init__(
+        self,
+        model,
+        tokenizer,
+        data_name_or_path: str,
+        prompt_col_name: str,
+        label_col_name: str,
+        device: Optional[str] = None,
+        **kwargs,
+    ):
+        kwargs["merge_prompt_label"] = True
+        super().__init__(
+            model=model,
+            tokenizer=tokenizer,
+            data_name_or_path=data_name_or_path,
+            prompt_col_name=prompt_col_name,
+            label_col_name=label_col_name,
+            device=device,
+            **kwargs,
+        )
+    def _predict(self, batch_data: Dict[str, Any], *args, **kwargs) -> List[float]:
+        outputs = self.model(**batch_data)
+        loss = outputs.loss.cpu().item()
+        return [loss]
+    def _parse_labels(self, label_ids: LongTensor) -> List[Any]:
+        return []
+    def _metric(self, pred: List[Any], label: List[Any]) -> Dict[str, float]:
+        return {"ppl": math.exp(sum(pred) / len(pred))}
+    def run(self) -> Dict[str, float]:
+        return super().run()
+__all__ = ["LanguageModelingTask"]
--- a/auto_gptq/eval_tasks/sequence_classification_task.py
+++ b/auto_gptq/eval_tasks/sequence_classification_task.py
+from collections import Counter
+from typing import Any, Dict, List, Optional
+import numpy as np
+from torch import LongTensor
+from transformers import GenerationConfig, PreTrainedTokenizer
+from ._base import BaseTask
+from ._utils.classification_utils import get_closest_label
+from ._utils.generation_utils import postprocess_generation_ids
+def get_predictions(
+    input_ids: LongTensor,
+    output_ids: LongTensor,
+    num_return_sequences: int,
+    tokenizer: PreTrainedTokenizer,
+    classes: List[str],
+) -> List[int]:
+    predictions = []
+    generated_texts = postprocess_generation_ids(
+        input_ids=input_ids,
+        output_ids=output_ids,
+        num_return_sequences=num_return_sequences,
+        tokenizer=tokenizer,
+    )
+    for sub_generated_texts in generated_texts:
+        sub_predictions = []
+        for gen_text in sub_generated_texts:
+            sub_predictions.append(get_closest_label(gen_text.lower().strip(), classes))
+        predictions.append(Counter(sub_predictions).most_common(1)[0][0])
+    return predictions
+class SequenceClassificationTask(BaseTask):
+    def __init__(
+        self,
+        model,
+        tokenizer: PreTrainedTokenizer,
+        classes: List[str],
+        data_name_or_path: str,
+        prompt_col_name: str,
+        label_col_name: str,
+        device: Optional[str] = None,
+        **kwargs,
+    ):
+        kwargs["merge_prompt_label"] = False
+        super().__init__(
+            model=model,
+            tokenizer=tokenizer,
+            data_name_or_path=data_name_or_path,
+            prompt_col_name=prompt_col_name,
+            label_col_name=label_col_name,
+            device=device,
+            **kwargs,
+        )
+        self.classes = [each.lower().strip() for each in classes]
+        classes_ids = self.tokenizer(classes)
+        self.max_new_tokens = max([len(each) for each in classes_ids])
+    def _predict(self, batch_data: Dict[str, Any], *args, **kwargs) -> List[int]:
+        generation_config = kwargs["generation_config"]
+        output_ids = self.model.generate(
+            input_ids=batch_data["input_ids"],
+            attention_mask=batch_data["attention_mask"],
+            generation_config=generation_config,
+        )
+        return get_predictions(
+            batch_data["input_ids"],
+            output_ids,
+            generation_config.num_return_sequences,
+            self.tokenizer,
+            self.classes,
+        )
+    def _parse_labels(self, label_ids: LongTensor) -> List[int]:
+        labels = []
+        for one_label_ids in label_ids:
+            one_label_ids = one_label_ids[(one_label_ids == -100).sum() :]
+            label = self.tokenizer.decode(one_label_ids, clean_up_tokenization_spaces=True).lower().strip()
+            label = get_closest_label(label, self.classes)
+            labels.append(label)
+        return labels
+    def _metric(self, pred: List[int], label: List[int]) -> Dict[str, float]:
+        pred = np.array(pred)
+        label = np.array(label)
+        acc = (pred == label).mean()
+        return {"acc": acc}
+    def run(self, generation_config: Optional[GenerationConfig] = None) -> Dict[str, float]:
+        if not generation_config:
+            generation_config = GenerationConfig(num_beams=1, do_sample=False, num_return_sequences=1)
+        generation_config.max_new_tokens = self.max_new_tokens
+        generation_config.eos_token_id = self.tokenizer.eos_token_id
+        generation_config.pad_token_id = self.tokenizer.pad_token_id
+        return super().run(generation_config=generation_config)
+__all__ = ["SequenceClassificationTask"]
--- a/auto_gptq/eval_tasks/text_summarization_task.py
+++ b/auto_gptq/eval_tasks/text_summarization_task.py
+from typing import Any, Dict, List, Optional
+import rouge
+from torch import LongTensor
+from transformers import GenerationConfig
+from ._base import BaseTask
+from ._utils.generation_utils import postprocess_generation_ids
+class TextSummarizationTask(BaseTask):
+    def __init__(
+        self,
+        model,
+        tokenizer,
+        data_name_or_path: str,
+        prompt_col_name: str,
+        label_col_name: str,
+        device: Optional[str] = None,
+        **kwargs,
+    ):
+        kwargs["merge_prompt_label"] = False
+        super().__init__(
+            model=model,
+            tokenizer=tokenizer,
+            data_name_or_path=data_name_or_path,
+            prompt_col_name=prompt_col_name,
+            label_col_name=label_col_name,
+            device=device,
+            **kwargs,
+        )
+    def _predict(self, batch_data: Dict[str, Any], *args, **kwargs) -> List[str]:
+        generation_config = kwargs["generation_config"]
+        output_ids = self.model.generate(
+            input_ids=batch_data["input_ids"],
+            attention_mask=batch_data["attention_mask"],
+            generation_config=generation_config,
+        )
+        return [
+            each[0].lower().strip()
+            for each in postprocess_generation_ids(
+                input_ids=batch_data["input_ids"],
+                output_ids=output_ids,
+                num_return_sequences=generation_config.num_return_sequences,
+                tokenizer=self.tokenizer,
+            )
+        ]
+    def _parse_labels(self, label_ids: LongTensor) -> List[str]:
+        labels = []
+        for one_label_ids in label_ids:
+            one_label_ids = one_label_ids[(one_label_ids == -100).sum() :]
+            label = self.tokenizer.decode(one_label_ids).lower().strip()
+            labels.append(label)
+        return labels
+    def _metric(self, pred: List[Any], label: List[Any]) -> Dict[str, Dict[str, float]]:
+        metric = rouge.Rouge()
+        return metric.get_scores(hyps=pred, refs=label, avg=True)
+    def run(self, generation_config: Optional[GenerationConfig] = None) -> Dict[str, float]:
+        if not generation_config:
+            generation_config = GenerationConfig(num_beams=1, do_sample=False, max_new_tokens=128)
+        generation_config.num_return_sequences = 1
+        generation_config.eos_token_id = self.tokenizer.eos_token_id
+        generation_config.pad_token_id = self.tokenizer.pad_token_id
+        return super().run(generation_config=generation_config)
+__all__ = ["TextSummarizationTask"]
--- a/auto_gptq/modeling/__init__.py
+++ b/auto_gptq/modeling/__init__.py
+from ._base import BaseGPTQForCausalLM, BaseQuantizeConfig
+from .auto import GPTQ_CAUSAL_LM_MODEL_MAP, AutoGPTQForCausalLM
+from .baichuan import BaiChuanGPTQForCausalLM
+from .bloom import BloomGPTQForCausalLM
+from .codegen import CodeGenGPTQForCausalLM
+from .cohere import CohereGPTQForCausalLM
+from .decilm import DeciLMGPTQForCausalLM
+from .gemma import GemmaGPTQForCausalLM
+from .gemma2 import Gemma2GPTQForCausalLM
+from .gpt2 import GPT2GPTQForCausalLM
+from .gpt_bigcode import GPTBigCodeGPTQForCausalLM
+from .gpt_neox import GPTNeoXGPTQForCausalLM
+from .gptj import GPTJGPTQForCausalLM
+from .internlm import InternLMGPTQForCausalLM
+from .llama import LlamaGPTQForCausalLM
+from .longllama import LongLlamaGPTQForCausalLM
+from .mistral import MistralGPTQForCausalLM
+from .mixtral import MixtralGPTQForCausalLM
+from .moss import MOSSGPTQForCausalLM
+from .mpt import MPTGPTQForCausalLM
+from .opt import OPTGPTQForCausalLM
+from .phi import PhiGPTQForCausalLM
+from .qwen import QwenGPTQForCausalLM
+from .qwen2 import Qwen2GPTQForCausalLM
+from .rw import RWGPTQForCausalLM
+from .stablelmepoch import StableLMEpochGPTQForCausalLM
+from .starcoder2 import Starcoder2GPTQForCausalLM
+from .xverse import XverseGPTQForCausalLM
+from .yi import YiGPTQForCausalLM
--- a/auto_gptq/modeling/__pycache__/__init__.cpython-310.pyc
+++ b/auto_gptq/modeling/__pycache__/__init__.cpython-310.pyc
--- a/auto_gptq/modeling/__pycache__/_base.cpython-310.pyc
+++ b/auto_gptq/modeling/__pycache__/_base.cpython-310.pyc
--- a/auto_gptq/modeling/__pycache__/_const.cpython-310.pyc
+++ b/auto_gptq/modeling/__pycache__/_const.cpython-310.pyc
--- a/auto_gptq/modeling/__pycache__/_utils.cpython-310.pyc
+++ b/auto_gptq/modeling/__pycache__/_utils.cpython-310.pyc
--- a/auto_gptq/modeling/__pycache__/auto.cpython-310.pyc
+++ b/auto_gptq/modeling/__pycache__/auto.cpython-310.pyc
--- a/auto_gptq/modeling/__pycache__/baichuan.cpython-310.pyc
+++ b/auto_gptq/modeling/__pycache__/baichuan.cpython-310.pyc
--- a/auto_gptq/modeling/__pycache__/bloom.cpython-310.pyc
+++ b/auto_gptq/modeling/__pycache__/bloom.cpython-310.pyc