{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AWQ on Vicuna" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we use Vicuna model to demonstrate the performance of AWQ on instruction-tuned models. We implement AWQ real-INT4 inference kernels, which are wrapped as Pytorch modules and can be easily used by existing models. We also provide a simple example to show how to use AWQ to quantize a model and save/load the quantized model checkpoint." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to run this notebook, you need to install the following packages:\n", "- [AWQ](https://github.com/mit-han-lab/llm-awq)\n", "- [Pytorch](https://pytorch.org/)\n", "- [Accelerate](https://github.com/huggingface/accelerate)\n", "- [Transformers](https://github.com/huggingface/transformers)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from awq.models.auto import AutoAWQForCausalLM\n", "from transformers import AutoTokenizer\n", "from tinychat.demo import gen_params, stream_output\n", "from tinychat.stream_generators import StreamGenerator\n", "from tinychat.modules import make_quant_norm, make_quant_attn, make_fused_mlp\n", "from tinychat.utils.prompt_templates import get_prompter\n", "import os\n", "# This demo only support single GPU for now\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Replacing layers...: 100%|██████████| 32/32 [00:02<00:00, 11.85it/s]\n" ] } ], "source": [ "model_path = 'vicuna-7b-v1.5-awq'\n", "quant_file = 'awq_model_w4_g128.pt'\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)\n", "model = AutoAWQForCausalLM.from_quantized(model_path, quant_file)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LlamaForCausalLM(\n", " (model): LlamaModel(\n", " (embed_tokens): Embedding(32000, 4096, padding_idx=0)\n", " (layers): ModuleList(\n", " (0-31): 32 x LlamaDecoderLayer(\n", " (self_attn): QuantLlamaAttention(\n", " (qkv_proj): WQLinear(in_features=4096, out_features=12288, bias=False, w_bit=4, group_size=128)\n", " (o_proj): WQLinear(in_features=4096, out_features=4096, bias=False, w_bit=4, group_size=128)\n", " (rotary_emb): QuantLlamaRotaryEmbedding()\n", " )\n", " (mlp): QuantLlamaMLP(\n", " (down_proj): WQLinear(in_features=11008, out_features=4096, bias=False, w_bit=4, group_size=128)\n", " )\n", " (input_layernorm): FTLlamaRMSNorm()\n", " (post_attention_layernorm): FTLlamaRMSNorm()\n", " )\n", " )\n", " (norm): FTLlamaRMSNorm()\n", " )\n", " (lm_head): Linear(in_features=4096, out_features=32000, bias=False)\n", ")" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "make_quant_attn(model.model, \"cuda:0\")\n", "make_quant_norm(model.model)\n", "make_fused_mlp(model.model)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ASSISTANT: Sure! Here are some popular tourist attractions in Boston:\n", "\n", "1. Freedom Trail - a 2.5-mile walking trail that takes you through some of the most important historical sites in Boston, including Paul Revere's House, the Old North Church, and the site of the Boston Massacre.\n", "2. Fenway Park - home to the Boston Red Sox baseball team, this historic ballpark is one of the oldest in Major League Baseball.\n", "3. Museum of Fine Arts - one of the largest art museums in the country, with a collection of over 450,000 works of art from around the world.\n", "4. Boston Harbor Islands National Recreation Area - a group of islands located just offshore from downtown Boston that offer stunning views of the city skyline and easy access to outdoor recreational activities like hiking and kayaking.\n", "5. New England Aquarium - one of the oldest and largest aquariums in the United States, featuring a wide variety of marine life, including giant whales and colorful fish.\n", "6. The USS Constitution Museum - located on board the USS Constitution, a historic ship that played a key role in the War of 1812 and is still in active service today.\n", "7. Bunker Hill Monument - a 221-foot-tall obelisk located in Charlestown that commemorates the Battle of Bunker Hill during the Revolutionary War.\n", "8. The Hancock Building - a historic building in the heart of Boston that offers panoramic views of the city from its observation deck.\n", "==================================================\n", "Speed of Inference\n", "--------------------------------------------------\n", "Generation Stage : 10.13 ms/token\n", "==================================================\n", "EXIT...\n" ] } ], "source": [ "model_prompter = get_prompter(model, model_path)\n", "stream_generator = StreamGenerator\n", "count = 0\n", "while True:\n", " # Get input from the user\n", " input_prompt = input(\"USER: \")\n", " if input_prompt == \"\":\n", " print(\"EXIT...\")\n", " break\n", " model_prompter.insert_prompt(input_prompt)\n", " output_stream = stream_generator(model, tokenizer, model_prompter.model_input, gen_params, device=\"cuda:0\")\n", " outputs = stream_output(output_stream) \n", " model_prompter.update_template(outputs)\n", " count += 1" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }