README.md 4.61 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
4

Casper's avatar
Casper committed
5
Roadmap:
Ji Lin's avatar
Ji Lin committed
6

Casper's avatar
Casper committed
7
- [x] Publish pip package
Casper's avatar
Casper committed
8
9
10
- [ ] Refactor quantization code
- [ ] Support more models
- [ ] Optimize the speed of models
Ji Lin's avatar
Ji Lin committed
11
12
13

## Install

Casper's avatar
Casper committed
14
15
16
Requirements: 
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.

Casper's avatar
Casper committed
17
18
19
20
21
22
23
24
25
26
27
28
Install:
- Use pip to install awq

```
pip install awq
```

### Build source

<details>

<summary>Build AutoAWQ from scratch</summary>
Casper Hansen's avatar
Casper Hansen committed
29

Ji Lin's avatar
Ji Lin committed
30
```
Casper's avatar
Casper committed
31
git clone https://github.com/casper-hansen/AutoAWQ
Casper's avatar
Casper committed
32
cd AutoAWQ
Ji Lin's avatar
Ji Lin committed
33
34
35
pip install -e .
```

Casper's avatar
Casper committed
36
37
</details>

Casper's avatar
Casper committed
38
## Supported models
Casper Hansen's avatar
Casper Hansen committed
39

Casper's avatar
Casper committed
40
The detailed support list:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
41

Casper's avatar
Casper committed
42
43
44
45
46
47
48
49
50
51
| Models   | Sizes                       |
| ---------| ----------------------------|
| LLaMA-2  | 7B/13B/70B                  |
| LLaMA    | 7B/13B/30B/65B              |
| Vicuna   | 7B/13B                      |
| MPT      | 7B/30B                      |
| Falcon   | 7B/40B                      |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom    | 560m/3B/7B/                 |
| LLaVA-v0 | 13B                         |
Casper's avatar
Casper committed
52
| GPTJ     | 6.7B                        |
Casper's avatar
Casper committed
53
54

## Usage
Ji Lin's avatar
Ji Lin committed
55

Casper's avatar
Casper committed
56
Below, you will find examples for how to easily quantize a model and run inference.
Casper Hansen's avatar
Casper Hansen committed
57

Casper's avatar
Casper committed
58
### Quantization
Casper Hansen's avatar
Casper Hansen committed
59

Casper's avatar
Casper committed
60
```python
Casper's avatar
Casper committed
61
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
62
from transformers import AutoTokenizer
Casper Hansen's avatar
Casper Hansen committed
63

Casper's avatar
Casper committed
64
65
66
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
Ji Lin's avatar
Ji Lin committed
67

Casper's avatar
Casper committed
68
69
70
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
71

Casper's avatar
Casper committed
72
73
74
75
76
77
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
78
79
```

Casper's avatar
Casper committed
80
### Inference
Ji Lin's avatar
Ji Lin committed
81

Casper's avatar
Casper committed
82
Run inference on a quantized model from Huggingface:
Ji Lin's avatar
Ji Lin committed
83

Casper's avatar
Casper committed
84
```python
Casper's avatar
Casper committed
85
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
86
from transformers import AutoTokenizer
Ji Lin's avatar
Ji Lin committed
87

Casper's avatar
Casper committed
88
89
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
Ji Lin's avatar
Ji Lin committed
90

Casper's avatar
Casper committed
91
92
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
93

Casper's avatar
Casper committed
94
95
model.generate(...)
```
Ji Lin's avatar
Ji Lin committed
96

Casper's avatar
Casper committed
97
## Benchmarks
Ji Lin's avatar
Ji Lin committed
98

Casper's avatar
Casper committed
99
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
Ji Lin's avatar
Ji Lin committed
100

Casper's avatar
Casper committed
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
| Model       | GPU   | FP16 latency (ms) | INT4 latency (ms) | Speedup |
| ----------- |:-----:|:-----------------:|:-----------------:|:-------:|
| LLaMA-2-7B  | 4090  | 19.97             | 8.66              | 2.31x   |
| LLaMA-2-13B | 4090  | OOM               | 13.54             | --      |
| Vicuna-7B   | 4090  | 19.09             | 8.61              | 2.22x   |
| Vicuna-13B  | 4090  | OOM               | 12.17             | --      |
| MPT-7B      | 4090  | 17.09             | 12.58             | 1.36x   |
| MPT-30B     | 4090  | OOM               | 23.54             | --      |
| Falcon-7B   | 4090  | 29.91             | 19.84             | 1.51x   |
| LLaMA-2-7B  | A6000 | 27.14             | 12.44             | 2.18x   |
| LLaMA-2-13B | A6000 | 47.28             | 20.28             | 2.33x   |
| Vicuna-7B   | A6000 | 26.06             | 12.43             | 2.10x   |
| Vicuna-13B  | A6000 | 44.91             | 17.30             | 2.60x   |
| MPT-7B      | A6000 | 22.79             | 16.87             | 1.35x   |
| MPT-30B     | A6000 | OOM               | 31.57             | --      |
| Falcon-7B   | A6000 | 39.44             | 27.34             | 1.44x   |
Ji Lin's avatar
Ji Lin committed
117

Casper's avatar
Casper committed
118
<details>
Ji Lin's avatar
Ji Lin committed
119

Casper's avatar
Casper committed
120
121
122
<summary>Detailed benchmark (CPU vs. GPU)</summary>

Here is the difference between a fast and slow CPU on MPT-7B:
Ji Lin's avatar
Ji Lin committed
123

Casper's avatar
Casper committed
124
125
126
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
Ji Lin's avatar
Ji Lin committed
127

Casper's avatar
Casper committed
128
129
130
131
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Ji Lin's avatar
Ji Lin committed
132

Casper's avatar
Casper committed
133
134
</details>

Ji Lin's avatar
Ji Lin committed
135
136
## Reference

Casper's avatar
Casper committed
137
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
138
139
140
141
142
143
144
145
146

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```