README.md 4.33 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
4

Casper's avatar
Casper committed
5
Roadmap:
Ji Lin's avatar
Ji Lin committed
6

Casper's avatar
Casper committed
7
8
9
10
- [ ] Publish pip package
- [ ] Refactor quantization code
- [ ] Support more models
- [ ] Optimize the speed of models
Ji Lin's avatar
Ji Lin committed
11
12
13

## Install

Casper Hansen's avatar
Casper Hansen committed
14
15
Clone this repository and install with pip.

Ji Lin's avatar
Ji Lin committed
16
```
Casper's avatar
Casper committed
17
git clone https://github.com/casper-hansen/AutoAWQ
Ji Lin's avatar
Ji Lin committed
18
19
20
21
cd llm-awq
pip install -e .
```

Casper's avatar
Casper committed
22
## Supported models
Casper Hansen's avatar
Casper Hansen committed
23

Casper's avatar
Casper committed
24
The detailed support list:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
25

Casper's avatar
Casper committed
26
27
28
29
30
31
32
33
34
35
36
37
| Models   | Sizes                       |
| ---------| ----------------------------|
| LLaMA-2  | 7B/13B/70B                  |
| LLaMA    | 7B/13B/30B/65B              |
| Vicuna   | 7B/13B                      |
| MPT      | 7B/30B                      |
| Falcon   | 7B/40B                      |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom    | 560m/3B/7B/                 |
| LLaVA-v0 | 13B                         |

## Usage
Ji Lin's avatar
Ji Lin committed
38

Casper's avatar
Casper committed
39
Below, you will find examples for how to easily quantize a model and run inference.
Casper Hansen's avatar
Casper Hansen committed
40

Casper's avatar
Casper committed
41
### Quantization
Casper Hansen's avatar
Casper Hansen committed
42

Casper's avatar
Casper committed
43
44
45
```python
from transformers import AutoTokenizer
from awq.models.auto import AutoAWQForCausalLM
Casper Hansen's avatar
Casper Hansen committed
46

Casper's avatar
Casper committed
47
48
49
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
Ji Lin's avatar
Ji Lin committed
50

Casper's avatar
Casper committed
51
52
53
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
54

Casper's avatar
Casper committed
55
56
57
58
59
60
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
61
62
```

Casper's avatar
Casper committed
63
### Inference
Ji Lin's avatar
Ji Lin committed
64

Casper's avatar
Casper committed
65
Run inference on a quantized model from Huggingface:
Ji Lin's avatar
Ji Lin committed
66

Casper's avatar
Casper committed
67
68
69
```python
from transformers import AutoTokenizer
from awq.models.auto import AutoAWQForCausalLM
Ji Lin's avatar
Ji Lin committed
70

Casper's avatar
Casper committed
71
72
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
Ji Lin's avatar
Ji Lin committed
73

Casper's avatar
Casper committed
74
75
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
76

Casper's avatar
Casper committed
77
78
model.generate(...)
```
Ji Lin's avatar
Ji Lin committed
79

Casper's avatar
Casper committed
80
## Benchmarks
Ji Lin's avatar
Ji Lin committed
81

Casper's avatar
Casper committed
82
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
Ji Lin's avatar
Ji Lin committed
83

Casper's avatar
Casper committed
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
| Model       | GPU   | FP16 latency (ms) | INT4 latency (ms) | Speedup |
| ----------- |:-----:|:-----------------:|:-----------------:|:-------:|
| LLaMA-2-7B  | 4090  | 19.97             | 8.66              | 2.31x   |
| LLaMA-2-13B | 4090  | OOM               | 13.54             | --      |
| Vicuna-7B   | 4090  | 19.09             | 8.61              | 2.22x   |
| Vicuna-13B  | 4090  | OOM               | 12.17             | --      |
| MPT-7B      | 4090  | 17.09             | 12.58             | 1.36x   |
| MPT-30B     | 4090  | OOM               | 23.54             | --      |
| Falcon-7B   | 4090  | 29.91             | 19.84             | 1.51x   |
| LLaMA-2-7B  | A6000 | 27.14             | 12.44             | 2.18x   |
| LLaMA-2-13B | A6000 | 47.28             | 20.28             | 2.33x   |
| Vicuna-7B   | A6000 | 26.06             | 12.43             | 2.10x   |
| Vicuna-13B  | A6000 | 44.91             | 17.30             | 2.60x   |
| MPT-7B      | A6000 | 22.79             | 16.87             | 1.35x   |
| MPT-30B     | A6000 | OOM               | 31.57             | --      |
| Falcon-7B   | A6000 | 39.44             | 27.34             | 1.44x   |
Ji Lin's avatar
Ji Lin committed
100
101


Casper's avatar
Casper committed
102
For example, here is the difference between a fast and slow CPU on MPT-7B:
Ji Lin's avatar
Ji Lin committed
103

Casper's avatar
Casper committed
104
105
106
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
Ji Lin's avatar
Ji Lin committed
107

Casper's avatar
Casper committed
108
109
110
111
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Ji Lin's avatar
Ji Lin committed
112
113
114

## Reference

Casper's avatar
Casper committed
115
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
116
117
118
119
120
121
122
123
124

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```