chat_demo.ipynb 6.51 KB
Newer Older
Ji Lin's avatar
Ji Lin committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AWQ on Vicuna"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we use Vicuna model to demonstrate the performance of AWQ on instruction-tuned models. We implement AWQ real-INT4 inference kernels, which are wrapped as Pytorch modules and can be easily used by existing models. We also provide a simple example to show how to use AWQ to quantize a model and save/load the quantized model checkpoint."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to run this notebook, you need to install the following packages:\n",
Sakits's avatar
Sakits committed
22
    "- [AWQ](https://github.com/mit-han-lab/llm-awq)\n",
Ji Lin's avatar
Ji Lin committed
23
24
25
26
27
28
29
    "- [Pytorch](https://pytorch.org/)\n",
    "- [Accelerate](https://github.com/huggingface/accelerate)\n",
    "- [Transformers](https://github.com/huggingface/transformers)"
   ]
  },
  {
   "cell_type": "code",
Haotian Tang's avatar
Haotian Tang committed
30
   "execution_count": 1,
Ji Lin's avatar
Ji Lin committed
31
32
33
   "metadata": {},
   "outputs": [],
   "source": [
Casper Hansen's avatar
Casper Hansen committed
34
35
    "from awq.models.auto import AutoAWQForCausalLM\n",
    "from transformers import AutoTokenizer\n",
Haotian Tang's avatar
Haotian Tang committed
36
37
38
39
    "from tinychat.demo import gen_params, stream_output\n",
    "from tinychat.stream_generators import StreamGenerator\n",
    "from tinychat.modules import make_quant_norm, make_quant_attn, make_fused_mlp\n",
    "from tinychat.utils.prompt_templates import get_prompter\n",
Ji Lin's avatar
Ji Lin committed
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    "import os\n",
    "# This demo only support single GPU for now\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
Casper Hansen's avatar
Casper Hansen committed
54
      "Replacing layers...: 100%|██████████| 32/32 [00:02<00:00, 11.85it/s]\n"
Ji Lin's avatar
Ji Lin committed
55
56
57
58
     ]
    }
   ],
   "source": [
Casper Hansen's avatar
Casper Hansen committed
59
60
    "model_path = 'vicuna-7b-v1.5-awq'\n",
    "quant_file = 'awq_model_w4_g128.pt'\n",
Ji Lin's avatar
Ji Lin committed
61
    "\n",
Casper Hansen's avatar
Casper Hansen committed
62
63
    "tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)\n",
    "model = AutoAWQForCausalLM.from_quantized(model_path, quant_file)"
Ji Lin's avatar
Ji Lin committed
64
65
66
67
   ]
  },
  {
   "cell_type": "code",
Casper Hansen's avatar
Casper Hansen committed
68
   "execution_count": 3,
Ji Lin's avatar
Ji Lin committed
69
70
   "metadata": {},
   "outputs": [
Haotian Tang's avatar
Haotian Tang committed
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
    {
     "data": {
      "text/plain": [
       "LlamaForCausalLM(\n",
       "  (model): LlamaModel(\n",
       "    (embed_tokens): Embedding(32000, 4096, padding_idx=0)\n",
       "    (layers): ModuleList(\n",
       "      (0-31): 32 x LlamaDecoderLayer(\n",
       "        (self_attn): QuantLlamaAttention(\n",
       "          (qkv_proj): WQLinear(in_features=4096, out_features=12288, bias=False, w_bit=4, group_size=128)\n",
       "          (o_proj): WQLinear(in_features=4096, out_features=4096, bias=False, w_bit=4, group_size=128)\n",
       "          (rotary_emb): QuantLlamaRotaryEmbedding()\n",
       "        )\n",
       "        (mlp): QuantLlamaMLP(\n",
       "          (down_proj): WQLinear(in_features=11008, out_features=4096, bias=False, w_bit=4, group_size=128)\n",
       "        )\n",
       "        (input_layernorm): FTLlamaRMSNorm()\n",
       "        (post_attention_layernorm): FTLlamaRMSNorm()\n",
       "      )\n",
       "    )\n",
       "    (norm): FTLlamaRMSNorm()\n",
       "  )\n",
       "  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)\n",
       ")"
      ]
     },
Casper Hansen's avatar
Casper Hansen committed
97
     "execution_count": 3,
Haotian Tang's avatar
Haotian Tang committed
98
99
     "metadata": {},
     "output_type": "execute_result"
Ji Lin's avatar
Ji Lin committed
100
101
102
    }
   ],
   "source": [
Casper Hansen's avatar
Casper Hansen committed
103
104
105
    "make_quant_attn(model.model, \"cuda:0\")\n",
    "make_quant_norm(model.model)\n",
    "make_fused_mlp(model.model)"
Haotian Tang's avatar
Haotian Tang committed
106
107
108
109
   ]
  },
  {
   "cell_type": "code",
Casper Hansen's avatar
Casper Hansen committed
110
   "execution_count": 4,
Haotian Tang's avatar
Haotian Tang committed
111
112
113
114
115
116
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
Casper Hansen's avatar
Casper Hansen committed
117
118
119
120
121
122
123
124
125
126
      "ASSISTANT: Sure! Here are some popular tourist attractions in Boston:\n",
      "\n",
      "1. Freedom Trail - a 2.5-mile walking trail that takes you through some of the most important historical sites in Boston, including Paul Revere's House, the Old North Church, and the site of the Boston Massacre.\n",
      "2. Fenway Park - home to the Boston Red Sox baseball team, this historic ballpark is one of the oldest in Major League Baseball.\n",
      "3. Museum of Fine Arts - one of the largest art museums in the country, with a collection of over 450,000 works of art from around the world.\n",
      "4. Boston Harbor Islands National Recreation Area - a group of islands located just offshore from downtown Boston that offer stunning views of the city skyline and easy access to outdoor recreational activities like hiking and kayaking.\n",
      "5. New England Aquarium - one of the oldest and largest aquariums in the United States, featuring a wide variety of marine life, including giant whales and colorful fish.\n",
      "6. The USS Constitution Museum - located on board the USS Constitution, a historic ship that played a key role in the War of 1812 and is still in active service today.\n",
      "7. Bunker Hill Monument - a 221-foot-tall obelisk located in Charlestown that commemorates the Battle of Bunker Hill during the Revolutionary War.\n",
      "8. The Hancock Building - a historic building in the heart of Boston that offers panoramic views of the city from its observation deck.\n",
Haotian Tang's avatar
Haotian Tang committed
127
128
129
      "==================================================\n",
      "Speed of Inference\n",
      "--------------------------------------------------\n",
Casper Hansen's avatar
Casper Hansen committed
130
131
      "Generation Stage : 10.13 ms/token\n",
      "==================================================\n",
Haotian Tang's avatar
Haotian Tang committed
132
133
134
135
136
      "EXIT...\n"
     ]
    }
   ],
   "source": [
Casper Hansen's avatar
Casper Hansen committed
137
    "model_prompter = get_prompter(model, model_path)\n",
Haotian Tang's avatar
Haotian Tang committed
138
139
    "stream_generator = StreamGenerator\n",
    "count = 0\n",
Ji Lin's avatar
Ji Lin committed
140
    "while True:\n",
Haotian Tang's avatar
Haotian Tang committed
141
142
143
144
    "    # Get input from the user\n",
    "    input_prompt = input(\"USER: \")\n",
    "    if input_prompt == \"\":\n",
    "        print(\"EXIT...\")\n",
Ji Lin's avatar
Ji Lin committed
145
    "        break\n",
Haotian Tang's avatar
Haotian Tang committed
146
147
148
149
150
    "    model_prompter.insert_prompt(input_prompt)\n",
    "    output_stream = stream_generator(model, tokenizer, model_prompter.model_input, gen_params, device=\"cuda:0\")\n",
    "    outputs = stream_output(output_stream)    \n",
    "    model_prompter.update_template(outputs)\n",
    "    count += 1"
Ji Lin's avatar
Ji Lin committed
151
152
153
154
155
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
Casper Hansen's avatar
Casper Hansen committed
156
   "display_name": "Python 3",
Ji Lin's avatar
Ji Lin committed
157
   "language": "python",
Casper Hansen's avatar
Casper Hansen committed
158
   "name": "python3"
Ji Lin's avatar
Ji Lin committed
159
160
161
162
163
164
165
166
167
168
169
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
Casper Hansen's avatar
Casper Hansen committed
170
   "version": "3.10.6"
Haotian Tang's avatar
Haotian Tang committed
171
  }
Ji Lin's avatar
Ji Lin committed
172
173
 },
 "nbformat": 4,
Haotian Tang's avatar
Haotian Tang committed
174
 "nbformat_minor": 4
Ji Lin's avatar
Ji Lin committed
175
}