AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
*Latest News* 🔥
- [2023/11] AutoAWQ has been merged into 🤗 transformers. Now includes CUDA 12.1 wheels.
- [2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels.
- [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
...
...
@@ -27,48 +27,28 @@ AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up
## Install
Requirements:
- Compute Capability 7.5 (sm75). Turing and later architectures are supported.
- CUDA Toolkit 11.8 and later.
### Prerequisites
---
- Your GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported.
- Your CUDA version must be CUDA 11.8 or later.
Install:
### Install from PyPi
- Install from PyPi distributed wheels (torch 2.1.0 + CUDA 12.1.1)
To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed.
```
pip install autoawq
```
- Install from GitHub a release (torch 2.0.1 + CUDA 11.8.0)
Remember to grab the right link for the [latest release](https://github.com/casper-hansen/AutoAWQ/releases) that matches your environment.
For example, this wheel is torch 2.0.1 with CUDA 11.8.0 and Python 3.10 for Linux:
If you cannot use CUDA 12.1, you can still use CUDA 11.8 and install the wheel from the [latest release](https://github.com/casper-hansen/AutoAWQ/releases).
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>"""
USER: {prompt}
ASSISTANT:"""
prompt="You're standing on the surface of the Earth. "\
"You walk one mile south, one mile west and one mile north. "\
"You end up exactly where you started. Where are you?"
tokens=tokenizer(
prompt_template.format(prompt="How are you today?"),
-`quant_path`: Path to folder containing model files.
-`quant_filename`: The filename to model weights or `index.json` file.
-`max_new_tokens`: The max sequence length, used to allocate kv-cache for fused models.
-`fuse_layers`: Whether or not to use fused layers.
-`batch_size`: The batch size to initialize the AWQ model with.
</details>
## Benchmarks
These benchmarks showcase the speed and memory usage of processing context (prefill) and generating tokens (decoding). The results include speed at various batch sizes and different versions of AWQ kernels. We have aimed to test models fairly using the same benchmarking tool that you can use to reproduce the results. Do note that speed may vary not only between GPUs but also between CPUs. What matters most is a GPU with high memory bandwidth and a CPU with high single core clock speed.