README.md 5.87 KB
Newer Older
Ji Lin's avatar
Ji Lin committed
1
# AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [[Paper](https://arxiv.org/abs/2306.00978)]
Ji Lin's avatar
Ji Lin committed
2

Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
3
**Efficient and accurate** low-bit weight quantization (INT3/4) for LLMs, supporting **instruction-tuned** models and **multi-modal** LMs.
Ji Lin's avatar
Ji Lin committed
4
5
6
7
8
9

![overview](figures/overview.png)

The current release supports: 

- AWQ search for accurate quantization. 
Ji Lin's avatar
Ji Lin committed
10
- Pre-computed AWQ model zoo for LLMs (LLaMA-1&2, OPT, Vicuna, LLaVA; load to generate quantized weights).
Ji Lin's avatar
Ji Lin committed
11
12
- Memory-efficient 4-bit Linear in PyTorch.
- Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
Ji Lin's avatar
Ji Lin committed
13
- Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (LLaVA).
Ji Lin's avatar
Ji Lin committed
14

Haotian Tang's avatar
Haotian Tang committed
15
16
![TinyChat on RTX 4090: W4A16 is 2.3x faster than FP16](./tinychat/figures/4090_example.gif)

Haotian Tang's avatar
Haotian Tang committed
17
18
19
Check out [TinyChat](tinychat), which delievers 2.3x faster inference performance for the **LLaMA-2** chatbot on RTX 4090! 

It also offers a turn-key solution for **on-device inferecne** of LLMs on **resource-constrained edge platforms**. With TinyChat, it is now possible to run **large** models on **small** and **low-power** devices even without Internet connection.
Haotian Tang's avatar
Haotian Tang committed
20
21


Ji Lin's avatar
Ji Lin committed
22
## News
Haotian Tang's avatar
Haotian Tang committed
23
- [2023/07] 🔥 We released **TinyChat**, an efficient and lightweight chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. LLama-2-chat models are supported! Check out our implementation [here](tinychat).
Ji Lin's avatar
Ji Lin committed
24
25
26
- [2023/07] 🔥 We added AWQ support and pre-computed search results for Llama-2 models (7B & 13B). Checkout our model zoo [here](https://huggingface.co/datasets/mit-han-lab/awq-model-zoo)!
- [2023/07] We extended the support for more LLM models including MPT, Falcon, and BLOOM. 

Ji Lin's avatar
Ji Lin committed
27
28
## Contents

Sakits's avatar
Sakits committed
29
30
31
32
33
- [Install](#install)
- [AWQ Model Zoo](#awq-model-zoo)
- [Examples](#examples)
- [Usage](#usage)
- [Reference](#reference)
Ji Lin's avatar
Ji Lin committed
34
35
36

## Install

Casper Hansen's avatar
Casper Hansen committed
37
38
Clone this repository and install with pip.

Ji Lin's avatar
Ji Lin committed
39
40
41
42
43
44
```
git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
pip install -e .
```

Casper Hansen's avatar
Casper Hansen committed
45
46
47
### CPU only

If you want to avoid installing CUDA kernels, pass the BUILD_CUDA_EXT environment variable:
Haotian (Ken) Tang's avatar
Haotian (Ken) Tang committed
48

Ji Lin's avatar
Ji Lin committed
49
```
Casper Hansen's avatar
Casper Hansen committed
50
BUILD_CUDA_EXT=0 pip install -e .
Ji Lin's avatar
Ji Lin committed
51
52
```

Casper Hansen's avatar
Casper Hansen committed
53
54
### Edge device

55
For **edge devices** like Jetson Orin:
Casper Hansen's avatar
Casper Hansen committed
56
57
58
59
60

1. Manually install precompiled PyTorch binaries (>=2.0.0) from [NVIDIA](https://forums.developer.nvidia.com/t/pytorch-for-jetson/72048).
2. Set the appropriate Python version for conda environment (e.g., `conda create -n awq python=3.8 -y` for JetPack 5).
3. Install AWQ: `TORCH_IS_PREBUILT=1 pip install -e .`

Ji Lin's avatar
Ji Lin committed
61
62
63
64
65
66
67
68
69
70
71
## AWQ Model Zoo

We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:

```bash
# git lfs install  # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
```

The detailed support list:

Casper Hansen's avatar
Casper Hansen committed
72
73
74
75
76
77
78
79
80
81
| Models   | Sizes                       | INT4-g128  | INT3-g128 |
| ---------| ----------------------------| -----------| --------- |
| LLaMA-2  | 7B/13B/70B                  | ✅         | ✅        |
| LLaMA    | 7B/13B/30B/65B              | ✅         | ✅        |
| Vicuna   | 7B/13B                      | ✅         |           |
| MPT      | 7B/30B                      | ✅         |           |
| Falcon   | 7B/40B                      | ✅         |           |
| OPT      | 125m/1.3B/2.7B/6.7B/13B/30B | ✅         | ✅        |
| Bloom    | 560m/3B/7B/                 | ✅         | ✅        |
| LLaVA-v0 | 13B                         | ✅         |           |
Ji Lin's avatar
Ji Lin committed
82
83
84
85
86
87
88
89
90
91
92
93
94

## Examples

AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs.

Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under `./examples` directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe **memory savings** when running the models with 4-bit weights. 

Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to `./examples` for details.

![overview](figures/example_vis.jpg)

## Usage

Casper Hansen's avatar
Casper Hansen committed
95
We provide several sample script to run AWQ (please refer to `./scripts`). We use Vicuna 7B v1.5 as an example.
Ji Lin's avatar
Ji Lin committed
96

Casper Hansen's avatar
Casper Hansen committed
97
1. Perform AWQ search and save search results
Ji Lin's avatar
Ji Lin committed
98
```bash
Casper Hansen's avatar
Casper Hansen committed
99
100
101
python -m awq.entry --entry_type search \
    --model_path lmsys/vicuna-7b-v1.5 \
    --search_path vicuna-7b-v1.5-awq
Ji Lin's avatar
Ji Lin committed
102
103
```

Casper Hansen's avatar
Casper Hansen committed
104
Note: if you use Falcon 7B, please pass `--q_group_size 64` in order for it to work.
Ji Lin's avatar
Ji Lin committed
105

Casper Hansen's avatar
Casper Hansen committed
106
2. Generate quantized weights and save them (INT4)
Ji Lin's avatar
Ji Lin committed
107
```bash
Casper Hansen's avatar
Casper Hansen committed
108
109
110
111
python -m awq.entry --entry_type quant \
    --model_path lmsys/vicuna-7b-v1.5 \
    --search_path vicuna-7b-v1.5-awq/awq_model_search_result.pt \
    --quant_path vicuna-7b-v1.5-awq
Ji Lin's avatar
Ji Lin committed
112
113
```

Casper Hansen's avatar
Casper Hansen committed
114
3. Load and evaluate the perplexity of the real quantized model weights (faster and uses less memory)
Ji Lin's avatar
Ji Lin committed
115
```bash
Casper Hansen's avatar
Casper Hansen committed
116
117
118
python -m awq.entry --entry_type perplexity \
    --quant_path vicuna-7b-v1.5-awq \
    --quant_file awq_model_w4_g128.pt
Ji Lin's avatar
Ji Lin committed
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
```

## Reference

If you find AWQ useful or relevant to your research, please kindly cite our paper:

```
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}
```

## Related Projects

[SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://github.com/mit-han-lab/smoothquant)

[GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://arxiv.org/abs/2210.17323)

[Vicuna and FastChat](https://github.com/lm-sys/FastChat#readme)

[LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)