README.md 4.94 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
4

Casper's avatar
Casper committed
5
Roadmap:
Ji Lin's avatar
Ji Lin committed
6

Casper's avatar
Casper committed
7
- [x] Publish pip package
Casper's avatar
Casper committed
8
9
10
- [ ] Refactor quantization code
- [ ] Support more models
- [ ] Optimize the speed of models
Ji Lin's avatar
Ji Lin committed
11
12
13

## Install

Casper's avatar
Casper committed
14
15
Requirements: 
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
Casper's avatar
Casper committed
16
- CUDA Toolkit 11.8 and later.
Casper's avatar
Casper committed
17

Casper's avatar
Casper committed
18
19
---

Casper's avatar
Casper committed
20
21
22
23
Install:
- Use pip to install awq

```
Casper's avatar
Casper committed
24
pip install autoawq
Casper's avatar
Casper committed
25
26
```

Casper's avatar
Casper committed
27
28
29
30
31
32
33
34
35
36
37
### Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

```
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
```

Casper's avatar
Casper committed
38
39
40
41
42
### Build source

<details>

<summary>Build AutoAWQ from scratch</summary>
Casper Hansen's avatar
Casper Hansen committed
43

Ji Lin's avatar
Ji Lin committed
44
```
Casper's avatar
Casper committed
45
git clone https://github.com/casper-hansen/AutoAWQ
Casper's avatar
Casper committed
46
cd AutoAWQ
Ji Lin's avatar
Ji Lin committed
47
48
49
pip install -e .
```

Casper's avatar
Casper committed
50
51
</details>

Casper's avatar
Casper committed
52
## Supported models
Casper Hansen's avatar
Casper Hansen committed
53

Casper's avatar
Casper committed
54
The detailed support list:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
55

Casper's avatar
Casper committed
56
57
58
59
60
61
62
63
64
65
| Models   | Sizes                       |
| ---------| ----------------------------|
| LLaMA-2  | 7B/13B/70B                  |
| LLaMA    | 7B/13B/30B/65B              |
| Vicuna   | 7B/13B                      |
| MPT      | 7B/30B                      |
| Falcon   | 7B/40B                      |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom    | 560m/3B/7B/                 |
| LLaVA-v0 | 13B                         |
Casper's avatar
Casper committed
66
| GPTJ     | 6.7B                        |
Casper's avatar
Casper committed
67
68

## Usage
Ji Lin's avatar
Ji Lin committed
69

Casper's avatar
Casper committed
70
Below, you will find examples for how to easily quantize a model and run inference.
Casper Hansen's avatar
Casper Hansen committed
71

Casper's avatar
Casper committed
72
### Quantization
Casper Hansen's avatar
Casper Hansen committed
73

Casper's avatar
Casper committed
74
```python
Casper's avatar
Casper committed
75
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
76
from transformers import AutoTokenizer
Casper Hansen's avatar
Casper Hansen committed
77

Casper's avatar
Casper committed
78
79
80
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
Ji Lin's avatar
Ji Lin committed
81

Casper's avatar
Casper committed
82
83
84
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
85

Casper's avatar
Casper committed
86
87
88
89
90
91
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
92
93
```

Casper's avatar
Casper committed
94
### Inference
Ji Lin's avatar
Ji Lin committed
95

Casper's avatar
Casper committed
96
Run inference on a quantized model from Huggingface:
Ji Lin's avatar
Ji Lin committed
97

Casper's avatar
Casper committed
98
```python
Casper's avatar
Casper committed
99
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
100
from transformers import AutoTokenizer
Ji Lin's avatar
Ji Lin committed
101

Casper's avatar
Casper committed
102
103
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
Ji Lin's avatar
Ji Lin committed
104

Casper's avatar
Casper committed
105
106
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
107

Casper's avatar
Casper committed
108
109
model.generate(...)
```
Ji Lin's avatar
Ji Lin committed
110

Casper's avatar
Casper committed
111
## Benchmarks
Ji Lin's avatar
Ji Lin committed
112

Casper's avatar
Casper committed
113
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
Ji Lin's avatar
Ji Lin committed
114

Casper's avatar
Casper committed
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
| Model       | GPU   | FP16 latency (ms) | INT4 latency (ms) | Speedup |
| ----------- |:-----:|:-----------------:|:-----------------:|:-------:|
| LLaMA-2-7B  | 4090  | 19.97             | 8.66              | 2.31x   |
| LLaMA-2-13B | 4090  | OOM               | 13.54             | --      |
| Vicuna-7B   | 4090  | 19.09             | 8.61              | 2.22x   |
| Vicuna-13B  | 4090  | OOM               | 12.17             | --      |
| MPT-7B      | 4090  | 17.09             | 12.58             | 1.36x   |
| MPT-30B     | 4090  | OOM               | 23.54             | --      |
| Falcon-7B   | 4090  | 29.91             | 19.84             | 1.51x   |
| LLaMA-2-7B  | A6000 | 27.14             | 12.44             | 2.18x   |
| LLaMA-2-13B | A6000 | 47.28             | 20.28             | 2.33x   |
| Vicuna-7B   | A6000 | 26.06             | 12.43             | 2.10x   |
| Vicuna-13B  | A6000 | 44.91             | 17.30             | 2.60x   |
| MPT-7B      | A6000 | 22.79             | 16.87             | 1.35x   |
| MPT-30B     | A6000 | OOM               | 31.57             | --      |
| Falcon-7B   | A6000 | 39.44             | 27.34             | 1.44x   |
Ji Lin's avatar
Ji Lin committed
131

Casper's avatar
Casper committed
132
<details>
Ji Lin's avatar
Ji Lin committed
133

Casper's avatar
Casper committed
134
135
136
<summary>Detailed benchmark (CPU vs. GPU)</summary>

Here is the difference between a fast and slow CPU on MPT-7B:
Ji Lin's avatar
Ji Lin committed
137

Casper's avatar
Casper committed
138
139
140
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
Ji Lin's avatar
Ji Lin committed
141

Casper's avatar
Casper committed
142
143
144
145
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Ji Lin's avatar
Ji Lin committed
146

Casper's avatar
Casper committed
147
148
</details>

Ji Lin's avatar
Ji Lin committed
149
150
## Reference

Casper's avatar
Casper committed
151
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
152
153
154
155
156
157
158
159
160

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```