README.md 5.33 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
4
5
<p align="center">
| <a href="https://github.com/casper-hansen/AutoAWQ/issues/32"><b>Roadmap</b></a> | <a href="https://github.com/casper-hansen/AutoAWQ/tree/main/examples"><b>Examples</b></a> | <a href="https://github.com/casper-hansen/AutoAWQ/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22"><b>Issues: Help Wanted</b></a> |
</p>
Ji Lin's avatar
Ji Lin committed
6

Casper's avatar
Casper committed
7
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs.  AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
8

Casper's avatar
Casper committed
9
10
11
*Latest News* 🔥
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Ji Lin's avatar
Ji Lin committed
12
13
14

## Install

Casper's avatar
Casper committed
15
16
Requirements: 
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
Casper's avatar
Casper committed
17
- CUDA Toolkit 11.8 and later.
Casper's avatar
Casper committed
18

Casper's avatar
Casper committed
19
20
---

Casper's avatar
Casper committed
21
22
23
24
Install:
- Use pip to install awq

```
Casper's avatar
Casper committed
25
pip install autoawq
Casper's avatar
Casper committed
26
27
```

Casper's avatar
Casper committed
28
29
30
31
32
33
34
35
36
37
38
### Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

```
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
```

Casper's avatar
Casper committed
39
40
41
42
43
### Build source

<details>

<summary>Build AutoAWQ from scratch</summary>
Casper Hansen's avatar
Casper Hansen committed
44

Ji Lin's avatar
Ji Lin committed
45
```
Casper's avatar
Casper committed
46
git clone https://github.com/casper-hansen/AutoAWQ
Casper's avatar
Casper committed
47
cd AutoAWQ
Ji Lin's avatar
Ji Lin committed
48
49
50
pip install -e .
```

Casper's avatar
Casper committed
51
52
</details>

Casper's avatar
Casper committed
53
## Supported models
Casper Hansen's avatar
Casper Hansen committed
54

Casper's avatar
Casper committed
55
The detailed support list:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
56

Casper's avatar
Casper committed
57
58
59
60
61
62
63
64
65
| Models   | Sizes                       |
| ---------| ----------------------------|
| LLaMA-2  | 7B/13B/70B                  |
| LLaMA    | 7B/13B/30B/65B              |
| Vicuna   | 7B/13B                      |
| MPT      | 7B/30B                      |
| Falcon   | 7B/40B                      |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom    | 560m/3B/7B/                 |
Casper's avatar
Casper committed
66
| GPTJ     | 6.7B                        |
Casper's avatar
Casper committed
67
68

## Usage
Ji Lin's avatar
Ji Lin committed
69

Casper's avatar
Casper committed
70
Below, you will find examples of how to easily quantize a model and run inference.
Casper Hansen's avatar
Casper Hansen committed
71

Casper's avatar
Casper committed
72
### Quantization
Casper Hansen's avatar
Casper Hansen committed
73

Casper's avatar
Casper committed
74
```python
Casper's avatar
Casper committed
75
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
76
from transformers import AutoTokenizer
Casper Hansen's avatar
Casper Hansen committed
77

Casper's avatar
Casper committed
78
79
80
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
Ji Lin's avatar
Ji Lin committed
81

Casper's avatar
Casper committed
82
83
84
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
85

Casper's avatar
Casper committed
86
87
88
89
90
91
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
92
93
```

Casper's avatar
Casper committed
94
### Inference
Ji Lin's avatar
Ji Lin committed
95

Casper's avatar
Casper committed
96
Run inference on a quantized model from Huggingface:
Ji Lin's avatar
Ji Lin committed
97

Casper's avatar
Casper committed
98
```python
Casper's avatar
Casper committed
99
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
100
from transformers import AutoTokenizer
Ji Lin's avatar
Ji Lin committed
101

Casper's avatar
Casper committed
102
103
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
Ji Lin's avatar
Ji Lin committed
104

Casper's avatar
Casper committed
105
106
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
107

Casper's avatar
Casper committed
108
109
model.generate(...)
```
Ji Lin's avatar
Ji Lin committed
110

Casper's avatar
Casper committed
111
## Benchmarks
Ji Lin's avatar
Ji Lin committed
112

Casper's avatar
Casper committed
113
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
Ji Lin's avatar
Ji Lin committed
114

Casper's avatar
Casper committed
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
| Model       | GPU   | FP16 latency (ms) | INT4 latency (ms) | Speedup |
| ----------- |:-----:|:-----------------:|:-----------------:|:-------:|
| LLaMA-2-7B  | 4090  | 19.97             | 8.66              | 2.31x   |
| LLaMA-2-13B | 4090  | OOM               | 13.54             | --      |
| Vicuna-7B   | 4090  | 19.09             | 8.61              | 2.22x   |
| Vicuna-13B  | 4090  | OOM               | 12.17             | --      |
| MPT-7B      | 4090  | 17.09             | 12.58             | 1.36x   |
| MPT-30B     | 4090  | OOM               | 23.54             | --      |
| Falcon-7B   | 4090  | 29.91             | 19.84             | 1.51x   |
| LLaMA-2-7B  | A6000 | 27.14             | 12.44             | 2.18x   |
| LLaMA-2-13B | A6000 | 47.28             | 20.28             | 2.33x   |
| Vicuna-7B   | A6000 | 26.06             | 12.43             | 2.10x   |
| Vicuna-13B  | A6000 | 44.91             | 17.30             | 2.60x   |
| MPT-7B      | A6000 | 22.79             | 16.87             | 1.35x   |
| MPT-30B     | A6000 | OOM               | 31.57             | --      |
| Falcon-7B   | A6000 | 39.44             | 27.34             | 1.44x   |
Ji Lin's avatar
Ji Lin committed
131

Casper's avatar
Casper committed
132
<details>
Ji Lin's avatar
Ji Lin committed
133

Casper's avatar
Casper committed
134
135
136
<summary>Detailed benchmark (CPU vs. GPU)</summary>

Here is the difference between a fast and slow CPU on MPT-7B:
Ji Lin's avatar
Ji Lin committed
137

Casper's avatar
Casper committed
138
139
140
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
Ji Lin's avatar
Ji Lin committed
141

Casper's avatar
Casper committed
142
143
144
145
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Ji Lin's avatar
Ji Lin committed
146

Casper's avatar
Casper committed
147
148
</details>

Ji Lin's avatar
Ji Lin committed
149
150
## Reference

Casper's avatar
Casper committed
151
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
152
153
154
155
156
157
158
159
160

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```