README.md 4.43 KB
Newer Older
Casper's avatar
Casper committed
1
# AutoAWQ
Ji Lin's avatar
Ji Lin committed
2

Casper's avatar
Casper committed
3
AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
Ji Lin's avatar
Ji Lin committed
4

Casper's avatar
Casper committed
5
Roadmap:
Ji Lin's avatar
Ji Lin committed
6

Casper's avatar
Casper committed
7
8
9
10
- [ ] Publish pip package
- [ ] Refactor quantization code
- [ ] Support more models
- [ ] Optimize the speed of models
Ji Lin's avatar
Ji Lin committed
11
12
13

## Install

Casper's avatar
Casper committed
14
15
16
Requirements: 
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.

Casper Hansen's avatar
Casper Hansen committed
17
18
Clone this repository and install with pip.

Ji Lin's avatar
Ji Lin committed
19
```
Casper's avatar
Casper committed
20
git clone https://github.com/casper-hansen/AutoAWQ
Casper's avatar
Casper committed
21
cd AutoAWQ
Ji Lin's avatar
Ji Lin committed
22
23
24
pip install -e .
```

Casper's avatar
Casper committed
25
## Supported models
Casper Hansen's avatar
Casper Hansen committed
26

Casper's avatar
Casper committed
27
The detailed support list:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
28

Casper's avatar
Casper committed
29
30
31
32
33
34
35
36
37
38
39
40
| Models   | Sizes                       |
| ---------| ----------------------------|
| LLaMA-2  | 7B/13B/70B                  |
| LLaMA    | 7B/13B/30B/65B              |
| Vicuna   | 7B/13B                      |
| MPT      | 7B/30B                      |
| Falcon   | 7B/40B                      |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom    | 560m/3B/7B/                 |
| LLaVA-v0 | 13B                         |

## Usage
Ji Lin's avatar
Ji Lin committed
41

Casper's avatar
Casper committed
42
Below, you will find examples for how to easily quantize a model and run inference.
Casper Hansen's avatar
Casper Hansen committed
43

Casper's avatar
Casper committed
44
### Quantization
Casper Hansen's avatar
Casper Hansen committed
45

Casper's avatar
Casper committed
46
47
48
```python
from transformers import AutoTokenizer
from awq.models.auto import AutoAWQForCausalLM
Casper Hansen's avatar
Casper Hansen committed
49

Casper's avatar
Casper committed
50
51
52
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
Ji Lin's avatar
Ji Lin committed
53

Casper's avatar
Casper committed
54
55
56
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
57

Casper's avatar
Casper committed
58
59
60
61
62
63
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Ji Lin's avatar
Ji Lin committed
64
65
```

Casper's avatar
Casper committed
66
### Inference
Ji Lin's avatar
Ji Lin committed
67

Casper's avatar
Casper committed
68
Run inference on a quantized model from Huggingface:
Ji Lin's avatar
Ji Lin committed
69

Casper's avatar
Casper committed
70
71
72
```python
from transformers import AutoTokenizer
from awq.models.auto import AutoAWQForCausalLM
Ji Lin's avatar
Ji Lin committed
73

Casper's avatar
Casper committed
74
75
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
Ji Lin's avatar
Ji Lin committed
76

Casper's avatar
Casper committed
77
78
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
Ji Lin's avatar
Ji Lin committed
79

Casper's avatar
Casper committed
80
81
model.generate(...)
```
Ji Lin's avatar
Ji Lin committed
82

Casper's avatar
Casper committed
83
## Benchmarks
Ji Lin's avatar
Ji Lin committed
84

Casper's avatar
Casper committed
85
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
Ji Lin's avatar
Ji Lin committed
86

Casper's avatar
Casper committed
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
| Model       | GPU   | FP16 latency (ms) | INT4 latency (ms) | Speedup |
| ----------- |:-----:|:-----------------:|:-----------------:|:-------:|
| LLaMA-2-7B  | 4090  | 19.97             | 8.66              | 2.31x   |
| LLaMA-2-13B | 4090  | OOM               | 13.54             | --      |
| Vicuna-7B   | 4090  | 19.09             | 8.61              | 2.22x   |
| Vicuna-13B  | 4090  | OOM               | 12.17             | --      |
| MPT-7B      | 4090  | 17.09             | 12.58             | 1.36x   |
| MPT-30B     | 4090  | OOM               | 23.54             | --      |
| Falcon-7B   | 4090  | 29.91             | 19.84             | 1.51x   |
| LLaMA-2-7B  | A6000 | 27.14             | 12.44             | 2.18x   |
| LLaMA-2-13B | A6000 | 47.28             | 20.28             | 2.33x   |
| Vicuna-7B   | A6000 | 26.06             | 12.43             | 2.10x   |
| Vicuna-13B  | A6000 | 44.91             | 17.30             | 2.60x   |
| MPT-7B      | A6000 | 22.79             | 16.87             | 1.35x   |
| MPT-30B     | A6000 | OOM               | 31.57             | --      |
| Falcon-7B   | A6000 | 39.44             | 27.34             | 1.44x   |
Ji Lin's avatar
Ji Lin committed
103
104


Casper's avatar
Casper committed
105
For example, here is the difference between a fast and slow CPU on MPT-7B:
Ji Lin's avatar
Ji Lin committed
106

Casper's avatar
Casper committed
107
108
109
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
Ji Lin's avatar
Ji Lin committed
110

Casper's avatar
Casper committed
111
112
113
114
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Ji Lin's avatar
Ji Lin committed
115
116
117

## Reference

Casper's avatar
Casper committed
118
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):
Ji Lin's avatar
Ji Lin committed
119
120
121
122
123
124
125
126
127

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```