README.md 14.3 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
4
<p align="center">
| <a href="https://github.com/casper-hansen/AutoAWQ/issues/32"><b>Roadmap</b></a> | <a href="https://github.com/casper-hansen/AutoAWQ/tree/main/examples"><b>Examples</b></a> | <a href="https://github.com/casper-hansen/AutoAWQ/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22"><b>Issues: Help Wanted</b></a> |
Casper's avatar
Casper committed
5
6
7
8

</p>
<p align="center">
    <a href="https://huggingface.co/models?search=awq">
Casper's avatar
Casper committed
9
        <img alt="Huggingface - Models" src="https://img.shields.io/badge/🤗_1000+_models_available-8A2BE2">
Casper's avatar
Casper committed
10
11
12
13
14
    </a>
    <a href="https://github.com/casper-hansen/AutoAWQ/releases">
        <img alt="GitHub - Releases" src="https://img.shields.io/github/release/casper-hansen/AutoAWQ.svg">
    </a>
    <a href="https://pypi.org/project/autoawq/">
Casper's avatar
Casper committed
15
        <img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/autoawq/month">
Casper's avatar
Casper committed
16
    </a>
Casper's avatar
Casper committed
17
</p>
Ji Lin's avatar
Ji Lin committed
18

Casper's avatar
Casper committed
19
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs.  AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
20

Casper's avatar
Casper committed
21
*Latest News* 🔥
22
- [2023/12] Mixtral, LLaVa, QWen, Baichuan model support.
Casper's avatar
Casper committed
23
- [2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels.
Casper's avatar
Casper committed
24
- [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
Casper Hansen's avatar
Casper Hansen committed
25
- [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
Casper's avatar
Casper committed
26
27
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Ji Lin's avatar
Ji Lin committed
28
29
30

## Install

Casper's avatar
Casper committed
31
### Prerequisites
Casper's avatar
Casper committed
32

Ilyas Moutawwakil's avatar
Ilyas Moutawwakil committed
33
34
35
36
37
- NVIDIA:
  - Your NVIDIA GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported.
  - Your CUDA version must be CUDA 11.8 or later.
- AMD:
  -  Your ROCm version must be ROCm 5.6 or later.
Casper's avatar
Casper committed
38

Casper's avatar
Casper committed
39
### Install from PyPi
Casper's avatar
Casper committed
40

Casper's avatar
Casper committed
41
To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed.
Casper's avatar
Casper committed
42
43

```
Casper's avatar
Casper committed
44
pip install autoawq
Casper's avatar
Casper committed
45
46
```

Ilyas Moutawwakil's avatar
Ilyas Moutawwakil committed
47
48
49
### Build from source

For CUDA 11.8, ROCm 5.6, and ROCm 5.7, you can install wheels from the [release page](https://github.com/casper-hansen/AutoAWQ/releases/latest):
Casper's avatar
Casper committed
50
51

```
Ilyas Moutawwakil's avatar
Ilyas Moutawwakil committed
52
pip install autoawq@https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.0/autoawq-0.2.0+cu118-cp310-cp310-linux_x86_64.whl
Casper's avatar
Casper committed
53
54
```

Ilyas Moutawwakil's avatar
Ilyas Moutawwakil committed
55
56
57
58
59
60
61
Or from the main branch directly:

```
pip install autoawq@https://github.com/casper-hansen/AutoAWQ.git
```

Or by cloning the repository and installing from source:
Casper's avatar
Casper committed
62

Ji Lin's avatar
Ji Lin committed
63
```
Casper's avatar
Casper committed
64
git clone https://github.com/casper-hansen/AutoAWQ
Casper's avatar
Casper committed
65
cd AutoAWQ
Ji Lin's avatar
Ji Lin committed
66
67
68
pip install -e .
```

Ilyas Moutawwakil's avatar
Ilyas Moutawwakil committed
69
70
71
72
All three methods will install the latest and correct kernels for your system from [AutoAWQ_Kernels](https://github.com/casper-hansen/AutoAWQ_kernels/releases). 

If your system is not supported (i.e. not on the release page), you can build the kernels yourself by following the instructions in [AutoAWQ_Kernels](https://github.com/casper-hansen/AutoAWQ_kernels/releases) and then install AutoAWQ from source.

Casper's avatar
Casper committed
73
## Usage
Ji Lin's avatar
Ji Lin committed
74

Casper's avatar
Casper committed
75
76
Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

77
78
### INT4 GEMM vs INT4 GEMV vs FP16

Casper's avatar
Casper committed
79
There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:
80

Casper's avatar
Casper committed
81
- GEMV (quantized): 20% faster than GEMM, only batch size 1 (not good for large context).
Casper's avatar
Casper committed
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
- GEMM (quantized): Much faster than FP16 at batch sizes below 8 (good with large contexts).
- FP16 (non-quantized): Recommended for highest throughput: [vLLM](https://github.com/vllm-project/vllm).

#### Compute-bound vs Memory-bound

At small batch sizes with small 7B models, we are memory-bound. This means we are bound by the bandwidth our GPU has to push around the weights in memory, and this is essentially what limits how many tokens per second we can generate. Being memory-bound is what makes quantized models faster because your weights are 3x smaller and can therefore be pushed around in memory much faster. This is different from being compute-bound where the main time spent during generation is doing matrix multiplication. 

In the scenario of being compute-bound, which happens at higher batch sizes, you will not gain a speed-up using a W4A16 quantized model because the overhead of dequantization will slow down the overall generation. This happens because AWQ quantized models only store the weights in INT4 but perform FP16 operations during inference, so we are essentially converting INT4 -> FP16 during inference.

### Fused modules

Fused modules are a large part of the speedup you get from AutoAWQ. The idea is to combine multiple layers into a single operation, thus becoming more efficient. Fused modules represent a set of custom modules that work separately from Huggingface models. They are compatible with `model.generate()` and other Huggingface methods, which comes with some inflexibility in how you can use your model if you activate fused modules:

- Fused modules are activated when you use `fuse_layers=True`.
- A custom cache is implemented. It preallocates based on batch size and sequence length.
Casper's avatar
Casper committed
97
    - You cannot change the sequence length after you have created your model.
Casper's avatar
Casper committed
98
    - Reference: `AutoAWQForCausalLM.from_quantized(max_seq_len=seq_len, batch_size=batch_size)`
Casper's avatar
Casper committed
99
100
- The main accelerator in the fused modules comes from FasterTransformer, which is only compatible with Linux.
- The `past_key_values` from `model.generate()` are only dummy values, so they cannot be used after generation.
101
102
103

### Examples

Casper's avatar
Casper committed
104
105
More examples can be found in the [examples directory](examples).

Casper's avatar
Casper committed
106
<details>
Casper Hansen's avatar
Casper Hansen committed
107

Casper's avatar
Casper committed
108
<summary>Quantization</summary>
Casper Hansen's avatar
Casper Hansen committed
109

110
111
Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

Casper's avatar
Casper committed
112
```python
Casper's avatar
Casper committed
113
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
114
from transformers import AutoTokenizer
Casper Hansen's avatar
Casper Hansen committed
115

Casper's avatar
Casper committed
116
117
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
Casper's avatar
Casper committed
118
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
Ji Lin's avatar
Ji Lin committed
119

Casper's avatar
Casper committed
120
121
122
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
123

Casper's avatar
Casper committed
124
125
126
127
128
129
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
130
131
```

Casper's avatar
Casper committed
132
133
134
</details>

<details>
Ji Lin's avatar
Ji Lin committed
135

Casper's avatar
Casper committed
136
<summary>Inference</summary>
Ji Lin's avatar
Ji Lin committed
137

Casper's avatar
Casper committed
138
```python
Casper's avatar
Casper committed
139
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
140
from transformers import AutoTokenizer, TextStreamer
Ji Lin's avatar
Ji Lin committed
141

Casper's avatar
Casper committed
142
quant_path = "TheBloke/zephyr-7B-beta-AWQ"
Ji Lin's avatar
Ji Lin committed
143

Casper's avatar
Casper committed
144
# Load model
Casper's avatar
Casper committed
145
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
Casper's avatar
Casper committed
146
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Casper's avatar
Casper committed
147
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
Casper's avatar
Casper committed
148
149
150

# Convert prompt to tokens
prompt_template = """\
Casper's avatar
Casper committed
151
152
153
154
155
<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>"""
Casper's avatar
Casper committed
156

Casper's avatar
Casper committed
157
158
159
prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"
Casper's avatar
Casper committed
160
161

tokens = tokenizer(
Casper's avatar
Casper committed
162
    prompt_template.format(prompt=prompt), 
Casper's avatar
Casper committed
163
164
165
166
167
168
169
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
Casper's avatar
Casper committed
170
    max_seq_len=512
Casper's avatar
Casper committed
171
)
Casper's avatar
Casper committed
172
```
Ji Lin's avatar
Ji Lin committed
173

Casper's avatar
Casper committed
174
</details>
Ji Lin's avatar
Ji Lin committed
175

Casper's avatar
Casper committed
176
## Benchmarks
Ji Lin's avatar
Ji Lin committed
177

Casper's avatar
Casper committed
178
179
180
181
These benchmarks showcase the speed and memory usage of processing context (prefill) and generating tokens (decoding). The results include speed at various batch sizes and different versions of AWQ kernels. We have aimed to test models fairly using the same benchmarking tool that you can use to reproduce the results. Do note that speed may vary not only between GPUs but also between CPUs. What matters most is a GPU with high memory bandwidth and a CPU with high single core clock speed.

- Tested with AutoAWQ version 0.1.6
- GPU: RTX 4090 (AMD Ryzen 9 7950X)
Casper's avatar
Casper committed
182
- Command: `python examples/benchmark.py --model_path <hf_model> --batch_size 1`
Casper's avatar
Casper committed
183
184
- 🟢 for GEMV, 🔵 for GEMM, 🔴 for avoid using

Ilyas Moutawwakil's avatar
Ilyas Moutawwakil committed
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
| Model Name | Size | Version | Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM)     |
| ---------- | ---- | ------- | ---------- | -------------- | ------------- | ---------------- | --------------- | ----------------- |
| Vicuna     | 7B   | 🟢GEMV   | 1          | 64             | 64            | 639.65           | 198.848         | 4.50 GB (19.05%)  |
| Vicuna     | 7B   | 🟢GEMV   | 1          | 2048           | 2048          | 1123.63          | 133.191         | 6.15 GB (26.02%)  |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| Mistral    | 7B   | 🔵GEMM   | 1          | 64             | 64            | 1093.35          | 156.317         | 4.35 GB (18.41%)  |
| Mistral    | 7B   | 🔵GEMM   | 1          | 2048           | 2048          | 3897.02          | 114.355         | 5.55 GB (23.48%)  |
| Mistral    | 7B   | 🔵GEMM   | 8          | 64             | 64            | 4199.18          | 1185.25         | 4.35 GB (18.41%)  |
| Mistral    | 7B   | 🔵GEMM   | 8          | 2048           | 2048          | 3661.46          | 829.754         | 16.82 GB (71.12%) |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| Mistral    | 7B   | 🟢GEMV   | 1          | 64             | 64            | 531.99           | 188.29          | 4.28 GB (18.08%)  |
| Mistral    | 7B   | 🟢GEMV   | 1          | 2048           | 2048          | 903.83           | 130.66          | 5.55 GB (23.48%)  |
| Mistral    | 7B   | 🔴GEMV   | 8          | 64             | 64            | 897.87           | 486.46          | 4.33 GB (18.31%)  |
| Mistral    | 7B   | 🔴GEMV   | 8          | 2048           | 2048          | 884.22           | 411.893         | 16.82 GB (71.12%) |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| TinyLlama  | 1B   | 🟢GEMV   | 1          | 64             | 64            | 1088.63          | 548.993         | 0.86 GB (3.62%)   |
| TinyLlama  | 1B   | 🟢GEMV   | 1          | 2048           | 2048          | 5178.98          | 431.468         | 2.10 GB (8.89%)   |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| Llama 2    | 13B  | 🔵GEMM   | 1          | 64             | 64            | 820.34           | 96.74           | 8.47 GB (35.83%)  |
| Llama 2    | 13B  | 🔵GEMM   | 1          | 2048           | 2048          | 2279.41          | 73.8213         | 10.28 GB (43.46%) |
| Llama 2    | 13B  | 🔵GEMM   | 3          | 64             | 64            | 1593.88          | 286.249         | 8.57 GB (36.24%)  |
| Llama 2    | 13B  | 🔵GEMM   | 3          | 2048           | 2048          | 2226.7           | 189.573         | 16.90 GB (71.47%) |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| MPT        | 7B   | 🔵GEMM   | 1          | 64             | 64            | 1079.06          | 161.344         | 3.67 GB (15.51%)  |
| MPT        | 7B   | 🔵GEMM   | 1          | 2048           | 2048          | 4069.78          | 114.982         | 5.87 GB (24.82%)  |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| Falcon     | 7B   | 🔵GEMM   | 1          | 64             | 64            | 1139.93          | 133.585         | 4.47 GB (18.92%)  |
| Falcon     | 7B   | 🔵GEMM   | 1          | 2048           | 2048          | 2850.97          | 115.73          | 6.83 GB (28.88%)  |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| CodeLlama  | 34B  | 🔵GEMM   | 1          | 64             | 64            | 681.74           | 41.01           | 19.05 GB (80.57%) |
| CodeLlama  | 34B  | 🔵GEMM   | 1          | 2048           | 2048          | 1072.36          | 35.8316         | 20.26 GB (85.68%) |
| ...        | ...  | ...     | ...        | ...            | ...           | ...              | ...             | ...               |
| DeepSeek   | 33B  | 🔵GEMM   | 1          | 64             | 64            | 1160.18          | 40.29           | 18.92 GB (80.00%) |
| DeepSeek   | 33B  | 🔵GEMM   | 1          | 2048           | 2048          | 1012.1           | 34.0093         | 19.87 GB (84.02%) |
219

220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
### Multi-GPU

GPU: 2x NVIDIA GeForce RTX 4090

| Model | Size    | Version       |   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|--------:|------:|--------------:|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
| Mixtral | 46.7B | 🔵GEMM        |            1 |               32 |              32 |            149.742 |           93.406  | 25.28 GB (53.44%) |
| Mixtral | 46.7B | 🔵GEMM        |            1 |               64 |              64 |           1489.64  |           93.184  | 25.32 GB (53.53%) |
| Mixtral | 46.7B | 🔵GEMM        |            1 |              128 |             128 |           2082.95  |           92.9444 | 25.33 GB (53.55%) |
| Mixtral | 46.7B | 🔵GEMM        |            1 |              256 |             256 |           2428.59  |           91.5187 | 25.35 GB (53.59%) |
| Mixtral | 46.7B | 🔵GEMM        |            1 |              512 |             512 |           2633.11  |           89.1457 | 25.39 GB (53.67%) |
| Mixtral | 46.7B | 🔵GEMM        |            1 |             1024 |            1024 |           2598.95  |           84.6753 | 25.75 GB (54.44%) |
| Mixtral | 46.7B | 🔵GEMM        |            1 |             2048 |            2048 |           2446.15  |           77.0516 | 27.98 GB (59.15%) |
| Mixtral | 46.7B | 🔵GEMM        |            1 |             4096 |            4096 |           1985.78  |           77.5689 | 34.65 GB (73.26%) |

Ji Lin's avatar
Ji Lin committed
235
236
## Reference

Casper's avatar
Casper committed
237
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
238
239
240
241
242
243
244
245
246

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```