# AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [[Paper](https://arxiv.org/abs/2306.00978)] **Efficient and accurate** low-bit weight quantization (INT3/4) for LLMs, supporting **instruction-tuned** models and **multi-modal** LMs. ![overview](figures/overview.png) The current release supports: - AWQ search for accurate quantization. - Pre-computed AWQ model zoo for LLMs (LLaMA, OPT, Vicuna, LLaVA; load to generate quantized weights). - Memory-efficient 4-bit Linear in PyTorch. - Efficient CUDA kernel implementation for fast inference (support context and decoding stage). - Examples on 4-bit inference of an instruction-tuned model (Vicuna) and mult-modal LM (LLaVA). ## Contents - [Install](#install) - [AWQ Model Zoo](#awq-model-zoo) - [Examples](#examples) - [Usage](#usage) - [Reference](#reference) ## Install 1. Clone this repository and navigate to AWQ folder ``` git clone https://github.com/mit-han-lab/llm-awq cd llm-awq ``` 2. Install Package ``` conda create -n awq python=3.10 -y conda activate awq pip install --upgrade pip # enable PEP 660 support pip install -e . ``` 3. Install kernel implementation ``` cd awq/kernels python setup.py install ``` ## AWQ Model Zoo We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run: ```bash # git lfs install # install git lfs if not already git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache ``` The detailed support list: | Models | Sizes | INT4-g128 | INT3-g128 | | ------ | --------------------------- | --------- | --------- | | LLaMA | 7B/13B/30B/65B | ✅ | ✅ | | OPT | 125m/1.3B/2.7B/6.7B/13B/30B | ✅ | ✅ | | Vicuna | 7B/13B | ✅ | | | LLaVA | 13B | ✅ | | ## Examples AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs. Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under `./examples` directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe **memory savings** when running the models with 4-bit weights. Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to `./examples` for details. ![overview](figures/example_vis.jpg) ## Usage We provide several sample script to run AWQ (please refer to `./scripts`). We use OPT-6.7B as an example. 1. Perform AWQ search and save search results (we already did it for you): ```bash python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ --w_bit 4 --q_group_size 128 \ --run_awq --dump_awq awq_cache/opt-6.7b-w4-g128.pt ``` 2. Evaluate the AWQ quantize model on WikiText-2 (simulated pseudo quantization) ```bash python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ --tasks wikitext \ --w_bit 4 --q_group_size 128 \ --load_awq awq_cache/opt-6.7b-w4-g128.pt \ --q_backend fake ``` 3. Generate real quantized weights (INT4) ```bash mkdir quant_cache python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ --w_bit 4 --q_group_size 128 \ --load_awq awq_cache/opt-6.7b-w4-g128.pt \ --q_backend real --dump_quant quant_cache/opt-6.7b-w4-g128-awq.pt ``` 4. Load and evaluate the real quantized model (now you can see smaller gpu memory usage) ```bash python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ --tasks wikitext \ --w_bit 4 --q_group_size 128 \ --load_quant quant_cache/opt-6.7b-w4-g128-awq.pt ``` ## Reference If you find AWQ useful or relevant to your research, please kindly cite our paper: ``` @article{lin2023awq, title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration}, author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song}, journal={arXiv}, year={2023} } ``` ## Related Projects [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://github.com/mit-han-lab/smoothquant) [GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://arxiv.org/abs/2210.17323) [Vicuna and FastChat](https://github.com/lm-sys/FastChat#readme) [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)