README.md 10.6 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
4
<p align="center">
| <a href="https://github.com/casper-hansen/AutoAWQ/issues/32"><b>Roadmap</b></a> | <a href="https://github.com/casper-hansen/AutoAWQ/tree/main/examples"><b>Examples</b></a> | <a href="https://github.com/casper-hansen/AutoAWQ/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22"><b>Issues: Help Wanted</b></a> |
Casper's avatar
Casper committed
5
6
7
8

</p>
<p align="center">
    <a href="https://huggingface.co/models?search=awq">
Casper's avatar
Casper committed
9
        <img alt="Huggingface - Models" src="https://img.shields.io/badge/🤗_600+_models_available-8A2BE2">
Casper's avatar
Casper committed
10
11
12
13
14
    </a>
    <a href="https://github.com/casper-hansen/AutoAWQ/releases">
        <img alt="GitHub - Releases" src="https://img.shields.io/github/release/casper-hansen/AutoAWQ.svg">
    </a>
    <a href="https://pypi.org/project/autoawq/">
Casper's avatar
Casper committed
15
        <img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/autoawq/month">
Casper's avatar
Casper committed
16
    </a>
Casper's avatar
Casper committed
17
</p>
Ji Lin's avatar
Ji Lin committed
18

Casper's avatar
Casper committed
19
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs.  AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
20

Casper's avatar
Casper committed
21
*Latest News* 🔥
Casper's avatar
Casper committed
22
- [2023/11] AutoAWQ has been merged into 🤗 transformers. Example found in: [examples/basic_transformers](examples/basic_transformers.py).
Casper's avatar
Casper committed
23
- [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
Casper Hansen's avatar
Casper Hansen committed
24
- [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
Casper's avatar
Casper committed
25
26
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Ji Lin's avatar
Ji Lin committed
27
28
29

## Install

Casper's avatar
Casper committed
30
Requirements: 
Casper's avatar
Casper committed
31
- Compute Capability 7.5 (sm75). Turing and later architectures are supported.
Casper's avatar
Casper committed
32
- CUDA Toolkit 11.8 and later.
Casper's avatar
Casper committed
33

Casper's avatar
Casper committed
34
35
---

Casper's avatar
Casper committed
36
37
38
39
Install:
- Use pip to install awq

```
Casper's avatar
Casper committed
40
pip install autoawq
Casper's avatar
Casper committed
41
42
```

Casper's avatar
Casper committed
43
44
45
46
47
48
49
50
51
52
53
### Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

```
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
```

Casper's avatar
Casper committed
54
55
56
57
58
### Build source

<details>

<summary>Build AutoAWQ from scratch</summary>
Casper Hansen's avatar
Casper Hansen committed
59

Casper's avatar
Casper committed
60
61
Build time can take 10 minutes. Download your model while you install AutoAWQ.

Ji Lin's avatar
Ji Lin committed
62
```
Casper's avatar
Casper committed
63
git clone https://github.com/casper-hansen/AutoAWQ
Casper's avatar
Casper committed
64
cd AutoAWQ
Ji Lin's avatar
Ji Lin committed
65
66
67
pip install -e .
```

Casper's avatar
Casper committed
68
69
</details>

Casper's avatar
Casper committed
70
## Supported models
Casper Hansen's avatar
Casper Hansen committed
71

Casper's avatar
Casper committed
72
The detailed support list:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
73

Casper's avatar
Casper committed
74
75
76
77
| Models   | Sizes                       |
| ---------| ----------------------------|
| LLaMA-2  | 7B/13B/70B                  |
| LLaMA    | 7B/13B/30B/65B              |
Casper's avatar
Casper committed
78
| Mistral  | 7B                          |
Casper's avatar
Casper committed
79
80
81
82
83
| Vicuna   | 7B/13B                      |
| MPT      | 7B/30B                      |
| Falcon   | 7B/40B                      |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom    | 560m/3B/7B/                 |
Casper's avatar
Casper committed
84
| GPTJ     | 6.7B                        |
ldwang's avatar
ldwang committed
85
86
| Aquila   | 7B                          |
| Aquila2  | 7B/34B                      |
Casper's avatar
Casper committed
87
88

## Usage
Ji Lin's avatar
Ji Lin committed
89

Casper's avatar
Casper committed
90
91
Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

92
93
### INT4 GEMM vs INT4 GEMV vs FP16

Casper's avatar
Casper committed
94
There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:
95

Casper's avatar
Casper committed
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
- GEMV (quantized): 20% faster than GEMM for small batch sizes (max batch size 4 / small context).
- GEMM (quantized): Much faster than FP16 at batch sizes below 8 (good with large contexts).
- FP16 (non-quantized): Recommended for highest throughput: [vLLM](https://github.com/vllm-project/vllm).

#### Compute-bound vs Memory-bound

At small batch sizes with small 7B models, we are memory-bound. This means we are bound by the bandwidth our GPU has to push around the weights in memory, and this is essentially what limits how many tokens per second we can generate. Being memory-bound is what makes quantized models faster because your weights are 3x smaller and can therefore be pushed around in memory much faster. This is different from being compute-bound where the main time spent during generation is doing matrix multiplication. 

In the scenario of being compute-bound, which happens at higher batch sizes, you will not gain a speed-up using a W4A16 quantized model because the overhead of dequantization will slow down the overall generation. This happens because AWQ quantized models only store the weights in INT4 but perform FP16 operations during inference, so we are essentially converting INT4 -> FP16 during inference.

### Fused modules

Fused modules are a large part of the speedup you get from AutoAWQ. The idea is to combine multiple layers into a single operation, thus becoming more efficient. Fused modules represent a set of custom modules that work separately from Huggingface models. They are compatible with `model.generate()` and other Huggingface methods, which comes with some inflexibility in how you can use your model if you activate fused modules:

- Fused modules are activated when you use `fuse_layers=True`.
- A custom cache is implemented. It preallocates based on batch size and sequence length.
    - You cannot change the sequence length or batch size after you have created your model.
    - Reference: `AutoAWQForCausalLM.from_quantized(max_new_tokens=seq_len, batch_size=batch_size)`
- The main accelerator in the fused modules comes from FasterTransformer, which is only compatible with Linux.
- The `past_key_values` from `model.generate()` are only dummy values, so they cannot be used after generation.
116
117
118

### Examples

Casper's avatar
Casper committed
119
120
More examples can be found in the [examples directory](examples).

Casper's avatar
Casper committed
121
<details>
Casper Hansen's avatar
Casper Hansen committed
122

Casper's avatar
Casper committed
123
<summary>Quantization</summary>
Casper Hansen's avatar
Casper Hansen committed
124

125
126
Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

Casper's avatar
Casper committed
127
```python
Casper's avatar
Casper committed
128
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
129
from transformers import AutoTokenizer
Casper Hansen's avatar
Casper Hansen committed
130

Casper's avatar
Casper committed
131
132
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
Casper's avatar
Casper committed
133
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
Ji Lin's avatar
Ji Lin committed
134

Casper's avatar
Casper committed
135
136
137
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
138

Casper's avatar
Casper committed
139
140
141
142
143
144
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
145
146
```

Casper's avatar
Casper committed
147
148
149
</details>

<details>
Ji Lin's avatar
Ji Lin committed
150

Casper's avatar
Casper committed
151
<summary>Inference</summary>
Ji Lin's avatar
Ji Lin committed
152

Casper's avatar
Casper committed
153
```python
Casper's avatar
Casper committed
154
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
155
from transformers import AutoTokenizer, TextStreamer
Ji Lin's avatar
Ji Lin committed
156

Casper's avatar
Casper committed
157
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
Ji Lin's avatar
Ji Lin committed
158

Casper's avatar
Casper committed
159
# Load model
Casper's avatar
Casper committed
160
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
Casper's avatar
Casper committed
161
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Casper's avatar
Casper committed
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
Casper's avatar
Casper committed
182
```
Ji Lin's avatar
Ji Lin committed
183

Casper's avatar
Casper committed
184
</details>
Ji Lin's avatar
Ji Lin committed
185

186
187
188
189
190
191
192
193
194
195
196
197
<details>

<summary>AutoAWQForCausalLM.from_quantized</summary>

- `quant_path`: Path to folder containing model files.
- `quant_filename`: The filename to model weights or `index.json` file.
- `max_new_tokens`: The max sequence length, used to allocate kv-cache for fused models.
- `fuse_layers`: Whether or not to use fused layers.
- `batch_size`: The batch size to initialize the AWQ model with.

</details>

Casper's avatar
Casper committed
198
## Benchmarks
Ji Lin's avatar
Ji Lin committed
199

Casper's avatar
Casper committed
200
201
202
- GPU: RTX 3090
- Command: `python examples/benchmark --model_path <hf_model>`

Casper's avatar
Casper committed
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
| Model Name    | Version | Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM)    |
|---------------|---------|------------|----------------|---------------|------------------|-----------------|------------------|
| Vicuna 7B     | GEMM    | 1          | 64             | 64            | 2618.88          | 125.428         | 4.57 GB (19.31%) |
| Vicuna 7B     | GEMM    | 1          | 128            | 128           | 2808.09          | 123.865         | 4.61 GB (19.44%) |
| ...           | ...     | ...        | ...            | ...           | ...              | ...             | ...              |
| Vicuna 7B     | GEMV    | 1          | 64             | 64            | 233.909          | 154.475         | 4.66 GB (19.68%) |
| Vicuna 7B     | GEMV    | 1          | 128            | 128           | 233.145          | 152.133         | 4.66 GB (19.68%) |
| ...           | ...     | ...        | ...            | ...           | ...              | ...             | ...              |
| MPT 7B        | GEMM    | 1          | 64             | 64            | 2752.9           | 120.772         | 3.67 GB (15.48%) |
| MPT 7B        | GEMM    | 1          | 128            | 128           | 2982.67          | 119.52          | 3.70 GB (15.61%) |
| ...           | ...     | ...        | ...            | ...           | ...              | ...             | ...              |
| MPT 7B        | GEMV    | 1          | 64             | 64            | 241.026          | 136.476         | 3.67 GB (15.48%) |
| MPT 7B        | GEMV    | 1          | 128            | 128           | 239.44           | 137.599         | 3.70 GB (15.61%) |
| ...           | ...     | ...        | ...            | ...           | ...              | ...             | ...              |
| Falcon 7B     | GEMM    | 1          | 64             | 64            | 1920.61          | 94.5963         | 4.48 GB (18.92%) |
| Falcon 7B     | GEMM    | 1          | 128            | 128           | 2406.1           | 94.793          | 4.48 GB (18.92%) |
| ...           | ...     | ...        | ...            | ...           | ...              | ...             | ...              |
| Aquila2 34B   | GEMM    | 1          | 64             | 64            | 516.544          | 23.3536         | 18.26 GB (46.12%)|
| Aquila2 34B   | GEMM    | 1          | 128            | 128           | 643.968          | 23.3803         | 18.26 GB (46.12%)|
| ...           | ...     | ...        | ...            | ...           | ...              | ...             | ...              |
223

Ji Lin's avatar
Ji Lin committed
224
225
## Reference

Casper's avatar
Casper committed
226
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
227
228
229
230
231
232
233
234
235

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```