README.md 4.64 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
4

Casper's avatar
Casper committed
5
Roadmap:
Ji Lin's avatar
Ji Lin committed
6

Casper's avatar
Casper committed
7
- [x] Publish pip package
Casper's avatar
Casper committed
8
9
10
- [ ] Refactor quantization code
- [ ] Support more models
- [ ] Optimize the speed of models
Ji Lin's avatar
Ji Lin committed
11
12
13

## Install

Casper's avatar
Casper committed
14
15
Requirements: 
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
Casper's avatar
Casper committed
16
- CUDA Toolkit 11.8 and later.
Casper's avatar
Casper committed
17

Casper's avatar
Casper committed
18
19
20
21
Install:
- Use pip to install awq

```
Casper's avatar
Casper committed
22
pip install autoawq
Casper's avatar
Casper committed
23
24
25
26
27
28
29
```

### Build source

<details>

<summary>Build AutoAWQ from scratch</summary>
Casper Hansen's avatar
Casper Hansen committed
30

Ji Lin's avatar
Ji Lin committed
31
```
Casper's avatar
Casper committed
32
git clone https://github.com/casper-hansen/AutoAWQ
Casper's avatar
Casper committed
33
cd AutoAWQ
Ji Lin's avatar
Ji Lin committed
34
35
36
pip install -e .
```

Casper's avatar
Casper committed
37
38
</details>

Casper's avatar
Casper committed
39
## Supported models
Casper Hansen's avatar
Casper Hansen committed
40

Casper's avatar
Casper committed
41
The detailed support list:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
42

Casper's avatar
Casper committed
43
44
45
46
47
48
49
50
51
52
| Models   | Sizes                       |
| ---------| ----------------------------|
| LLaMA-2  | 7B/13B/70B                  |
| LLaMA    | 7B/13B/30B/65B              |
| Vicuna   | 7B/13B                      |
| MPT      | 7B/30B                      |
| Falcon   | 7B/40B                      |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom    | 560m/3B/7B/                 |
| LLaVA-v0 | 13B                         |
Casper's avatar
Casper committed
53
| GPTJ     | 6.7B                        |
Casper's avatar
Casper committed
54
55

## Usage
Ji Lin's avatar
Ji Lin committed
56

Casper's avatar
Casper committed
57
Below, you will find examples for how to easily quantize a model and run inference.
Casper Hansen's avatar
Casper Hansen committed
58

Casper's avatar
Casper committed
59
### Quantization
Casper Hansen's avatar
Casper Hansen committed
60

Casper's avatar
Casper committed
61
```python
Casper's avatar
Casper committed
62
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
63
from transformers import AutoTokenizer
Casper Hansen's avatar
Casper Hansen committed
64

Casper's avatar
Casper committed
65
66
67
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
Ji Lin's avatar
Ji Lin committed
68

Casper's avatar
Casper committed
69
70
71
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
72

Casper's avatar
Casper committed
73
74
75
76
77
78
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
79
80
```

Casper's avatar
Casper committed
81
### Inference
Ji Lin's avatar
Ji Lin committed
82

Casper's avatar
Casper committed
83
Run inference on a quantized model from Huggingface:
Ji Lin's avatar
Ji Lin committed
84

Casper's avatar
Casper committed
85
```python
Casper's avatar
Casper committed
86
from awq import AutoAWQForCausalLM
Casper's avatar
Casper committed
87
from transformers import AutoTokenizer
Ji Lin's avatar
Ji Lin committed
88

Casper's avatar
Casper committed
89
90
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
Ji Lin's avatar
Ji Lin committed
91

Casper's avatar
Casper committed
92
93
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
94

Casper's avatar
Casper committed
95
96
model.generate(...)
```
Ji Lin's avatar
Ji Lin committed
97

Casper's avatar
Casper committed
98
## Benchmarks
Ji Lin's avatar
Ji Lin committed
99

Casper's avatar
Casper committed
100
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
Ji Lin's avatar
Ji Lin committed
101

Casper's avatar
Casper committed
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
| Model       | GPU   | FP16 latency (ms) | INT4 latency (ms) | Speedup |
| ----------- |:-----:|:-----------------:|:-----------------:|:-------:|
| LLaMA-2-7B  | 4090  | 19.97             | 8.66              | 2.31x   |
| LLaMA-2-13B | 4090  | OOM               | 13.54             | --      |
| Vicuna-7B   | 4090  | 19.09             | 8.61              | 2.22x   |
| Vicuna-13B  | 4090  | OOM               | 12.17             | --      |
| MPT-7B      | 4090  | 17.09             | 12.58             | 1.36x   |
| MPT-30B     | 4090  | OOM               | 23.54             | --      |
| Falcon-7B   | 4090  | 29.91             | 19.84             | 1.51x   |
| LLaMA-2-7B  | A6000 | 27.14             | 12.44             | 2.18x   |
| LLaMA-2-13B | A6000 | 47.28             | 20.28             | 2.33x   |
| Vicuna-7B   | A6000 | 26.06             | 12.43             | 2.10x   |
| Vicuna-13B  | A6000 | 44.91             | 17.30             | 2.60x   |
| MPT-7B      | A6000 | 22.79             | 16.87             | 1.35x   |
| MPT-30B     | A6000 | OOM               | 31.57             | --      |
| Falcon-7B   | A6000 | 39.44             | 27.34             | 1.44x   |
Ji Lin's avatar
Ji Lin committed
118

Casper's avatar
Casper committed
119
<details>
Ji Lin's avatar
Ji Lin committed
120

Casper's avatar
Casper committed
121
122
123
<summary>Detailed benchmark (CPU vs. GPU)</summary>

Here is the difference between a fast and slow CPU on MPT-7B:
Ji Lin's avatar
Ji Lin committed
124

Casper's avatar
Casper committed
125
126
127
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
Ji Lin's avatar
Ji Lin committed
128

Casper's avatar
Casper committed
129
130
131
132
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Ji Lin's avatar
Ji Lin committed
133

Casper's avatar
Casper committed
134
135
</details>

Ji Lin's avatar
Ji Lin committed
136
137
## Reference

Casper's avatar
Casper committed
138
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
139
140
141
142
143
144
145
146
147

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```