{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AWQ on Vicuna" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we use Vicuna model to demonstrate the performance of AWQ on instruction-tuned models. We implement AWQ real-INT4 inference kernels, which are wrapped as Pytorch modules and can be easily used by existing models. We also provide a simple example to show how to use AWQ to quantize a model and save/load the quantized model checkpoint." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to run this notebook, you need to install the following packages:\n", "- [AWQ](https://github.com/mit-han-lab/llm-awq)\n", "- [Pytorch](https://pytorch.org/)\n", "- [Accelerate](https://github.com/huggingface/accelerate)\n", "- [Transformers](https://github.com/huggingface/transformers)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from accelerate import init_empty_weights, load_checkpoint_and_dispatch\n", "from awq.quantize.quantizer import real_quantize_model_weight\n", "from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig\n", "from tinychat.demo import gen_params, stream_output\n", "from tinychat.stream_generators import StreamGenerator\n", "from tinychat.modules import make_quant_norm, make_quant_attn, make_fused_mlp\n", "from tinychat.utils.prompt_templates import get_prompter\n", "import os\n", "# This demo only support single GPU for now\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please get the Vicuna model from [FastChat](https://github.com/lm-sys/FastChat) and run the following command to generate a quantized model checkpoint first.\n", "\n", "```bash\n", "mkdir quant_cache\n", "python -m awq.entry --model_path [vicuna-7b_model_path] \\\n", " --w_bit 4 --q_group_size 128 \\\n", " --load_awq awq_cache/vicuna-7b-w4-g128.pt \\\n", " --q_backend real --dump_quant quant_cache/vicuna-7b-w4-g128-awq.pt\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# model_path = \"\" # the path of vicuna-7b model\n", "# load_quant_path = \"quant_cache/vicuna-7b-w4-g128-awq.pt\"\n", "model_path = \"/data/llm/checkpoints/vicuna-hf/vicuna-7b\"\n", "load_quant_path = \"/data/llm/checkpoints/vicuna-hf/vicuna-7b-awq-w4g128.pt\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first load a empty model and replace all the linear layers with WQLinear layers. Then we load the quantized weights from the checkpoint. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8b79a82b73ab4d9191ba54f5d0f8cb86", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00