Commit 6924d106 authored by Casper's avatar Casper
Browse files

Remove examples

parent b85ee5c1
# AWQ Examples
Here we provide two AWQ examples, applying to:
- [Vicuna-7B](https://github.com/lm-sys/FastChat), a chatbot with instruction-tuning
- [LLaVA-13B](https://github.com/lm-sys/FastChat), a visual LM for multi-modal applications like visual reasoning.
Here are some example output from the two demos. You should able to observe memory saving when running the demos in 4-bit. Please check the notebooks for details.
![overview](../figures/example_vis.jpg)
\ No newline at end of file
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AWQ on Vicuna"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, we use Vicuna model to demonstrate the performance of AWQ on instruction-tuned models. We implement AWQ real-INT4 inference kernels, which are wrapped as Pytorch modules and can be easily used by existing models. We also provide a simple example to show how to use AWQ to quantize a model and save/load the quantized model checkpoint."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to run this notebook, you need to install the following packages:\n",
"- [AWQ](https://github.com/mit-han-lab/llm-awq)\n",
"- [Pytorch](https://pytorch.org/)\n",
"- [Accelerate](https://github.com/huggingface/accelerate)\n",
"- [Transformers](https://github.com/huggingface/transformers)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from awq.models.auto import AutoAWQForCausalLM\n",
"from transformers import AutoTokenizer\n",
"from tinychat.demo import gen_params, stream_output\n",
"from tinychat.stream_generators import StreamGenerator\n",
"from tinychat.modules import make_quant_norm, make_quant_attn, make_fused_mlp\n",
"from tinychat.utils.prompt_templates import get_prompter\n",
"import os\n",
"# This demo only support single GPU for now\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Replacing layers...: 100%|██████████| 32/32 [00:02<00:00, 11.85it/s]\n"
]
}
],
"source": [
"model_path = 'vicuna-7b-v1.5-awq'\n",
"quant_file = 'awq_model_w4_g128.pt'\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)\n",
"model = AutoAWQForCausalLM.from_quantized(model_path, quant_file)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LlamaForCausalLM(\n",
" (model): LlamaModel(\n",
" (embed_tokens): Embedding(32000, 4096, padding_idx=0)\n",
" (layers): ModuleList(\n",
" (0-31): 32 x LlamaDecoderLayer(\n",
" (self_attn): QuantLlamaAttention(\n",
" (qkv_proj): WQLinear(in_features=4096, out_features=12288, bias=False, w_bit=4, group_size=128)\n",
" (o_proj): WQLinear(in_features=4096, out_features=4096, bias=False, w_bit=4, group_size=128)\n",
" (rotary_emb): QuantLlamaRotaryEmbedding()\n",
" )\n",
" (mlp): QuantLlamaMLP(\n",
" (down_proj): WQLinear(in_features=11008, out_features=4096, bias=False, w_bit=4, group_size=128)\n",
" )\n",
" (input_layernorm): FTLlamaRMSNorm()\n",
" (post_attention_layernorm): FTLlamaRMSNorm()\n",
" )\n",
" )\n",
" (norm): FTLlamaRMSNorm()\n",
" )\n",
" (lm_head): Linear(in_features=4096, out_features=32000, bias=False)\n",
")"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"make_quant_attn(model.model, \"cuda:0\")\n",
"make_quant_norm(model.model)\n",
"make_fused_mlp(model.model)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ASSISTANT: Sure! Here are some popular tourist attractions in Boston:\n",
"\n",
"1. Freedom Trail - a 2.5-mile walking trail that takes you through some of the most important historical sites in Boston, including Paul Revere's House, the Old North Church, and the site of the Boston Massacre.\n",
"2. Fenway Park - home to the Boston Red Sox baseball team, this historic ballpark is one of the oldest in Major League Baseball.\n",
"3. Museum of Fine Arts - one of the largest art museums in the country, with a collection of over 450,000 works of art from around the world.\n",
"4. Boston Harbor Islands National Recreation Area - a group of islands located just offshore from downtown Boston that offer stunning views of the city skyline and easy access to outdoor recreational activities like hiking and kayaking.\n",
"5. New England Aquarium - one of the oldest and largest aquariums in the United States, featuring a wide variety of marine life, including giant whales and colorful fish.\n",
"6. The USS Constitution Museum - located on board the USS Constitution, a historic ship that played a key role in the War of 1812 and is still in active service today.\n",
"7. Bunker Hill Monument - a 221-foot-tall obelisk located in Charlestown that commemorates the Battle of Bunker Hill during the Revolutionary War.\n",
"8. The Hancock Building - a historic building in the heart of Boston that offers panoramic views of the city from its observation deck.\n",
"==================================================\n",
"Speed of Inference\n",
"--------------------------------------------------\n",
"Generation Stage : 10.13 ms/token\n",
"==================================================\n",
"EXIT...\n"
]
}
],
"source": [
"model_prompter = get_prompter(model, model_path)\n",
"stream_generator = StreamGenerator\n",
"count = 0\n",
"while True:\n",
" # Get input from the user\n",
" input_prompt = input(\"USER: \")\n",
" if input_prompt == \"\":\n",
" print(\"EXIT...\")\n",
" break\n",
" model_prompter.insert_prompt(input_prompt)\n",
" output_stream = stream_generator(model, tokenizer, model_prompter.model_input, gen_params, device=\"cuda:0\")\n",
" outputs = stream_output(output_stream) \n",
" model_prompter.update_template(outputs)\n",
" count += 1"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# AWQ on LLaVA"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, we use LLaVA model to demonstrate the performance of AWQ on multi-modal models. We implement AWQ real-INT4 inference kernels, which are wrapped as Pytorch modules and can be easily used by existing models. We also provide a simple example to show how to use AWQ to quantize a model and save/load the quantized model checkpoint."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to run this notebook, you need to install the following packages:\n",
"- [AWQ](https://github.com/mit-han-lab/llm-awq)\n",
"- [Pytorch](https://pytorch.org/)\n",
"- [Accelerate](https://github.com/huggingface/accelerate)\n",
"- [LLaVA](https://github.com/haotian-liu/LLaVA)\n",
"- [Transformers](https://github.com/huggingface/transformers)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jilin/anaconda3/envs/llava/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"import torch\n",
"import requests\n",
"from PIL import Image\n",
"from io import BytesIO\n",
"from accelerate import init_empty_weights, load_checkpoint_and_dispatch\n",
"from transformers import AutoTokenizer, CLIPVisionModel, CLIPImageProcessor, logging\n",
"logging.set_verbosity_error() # too many warnings\n",
"from llava.conversation import conv_templates, SeparatorStyle\n",
"from llava.utils import disable_torch_init\n",
"from llava.model import *\n",
"from llava.model.utils import KeywordsStoppingCriteria\n",
"from awq.models.auto import AutoAWQForCausalLM\n",
"import os\n",
"import gc\n",
"\n",
"from awq.quantize.auto_clip import apply_clip\n",
"from awq.quantize.auto_scale import apply_scale\n",
"\n",
"# This demo only support single GPU for now\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
"DEFAULT_IMAGE_TOKEN = \"<image>\"\n",
"DEFAULT_IMAGE_PATCH_TOKEN = \"<im_patch>\"\n",
"DEFAULT_IM_START_TOKEN = \"<im_start>\"\n",
"DEFAULT_IM_END_TOKEN = \"<im_end>\"\n",
"\n",
"def load_search_result_into_memory(model, search_path):\n",
" awq_results = torch.load(search_path, map_location=\"cpu\")\n",
" \n",
" apply_scale(model, awq_results[\"scale\"])\n",
" apply_clip(model, awq_results[\"clip\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Please get the LLaVA model from [LLaVA](https://github.com/haotian-liu/LLaVA) and run the following cell to generate a quantized model checkpoint first (note that we only quantize the language decoder, which dominates the model parameters). "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:09<00:00, 3.14s/it]\n",
"real weight quantization...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [09:07<00:00, 13.69s/it]\n"
]
}
],
"source": [
"model_path = \"liuhaotian/LLaVA-13b-delta-v0\"\n",
"quant_path = \"LLaVA-13B-v0-awq\" # place to dump quant weights\n",
"search_path = \"../awq_cache/llava-13b-v0-w4-g128.pt\" # place where you stored search results\n",
"\n",
"# Load model\n",
"model = AutoAWQForCausalLM.from_pretrained(model_path)\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n",
"quant_config = {\"zero_point\": True, \"q_group_size\": 128, \"w_bit\": 4}\n",
"\n",
"# Load model and search results\n",
"load_search_result_into_memory(model.model, search_path)\n",
"\n",
"# Run actual weight quantization\n",
"model.quantize(quant_config=quant_config, run_search=False, run_quant=True)\n",
"\n",
"# Save quantized model\n",
"model.save_quantized(quant_path)\n",
"\n",
"del model"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Then input a image link and a question below.\n",
"\n",
"![](https://llava.hliu.cc/file=/nobackup/haotian/code/LLaVA/llava/serve/examples/extreme_ironing.jpg)\n",
"\n",
"## Q: What is unusual about this image?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"query = \"What is unusual about this image?\"\n",
"image_file = \"https://llava.hliu.cc/file=/nobackup/haotian/code/LLaVA/llava/serve/examples/extreme_ironing.jpg\" "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We first load a empty model and replace all the linear layers with WQLinear layers. Then we load the quantized weights from the checkpoint. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00, 5.17s/it]\n",
"real weight quantization...(init only): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:37<00:00, 1.08it/s]\n"
]
}
],
"source": [
"disable_torch_init()\n",
"\n",
"# Load model\n",
"model = AutoAWQForCausalLM.from_quantized(quant_path, quant_filename=\"awq_model_w4_g128.pt\")\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jilin/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py:1211: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The unusual aspect of this image is that a man is standing on a portable ironing board in the middle of the road, ironing clothes while traffic, including a yellow taxi, moves around him. This is not a typical scene you would expect to see in a city, as ironing is usually done in a private setting like a home, and not on the street amidst traffic. It brings attention to the unconventional and unexpected nature of the situation.\n"
]
}
],
"source": [
"def load_image(image_file):\n",
" if image_file.startswith('http') or image_file.startswith('https'):\n",
" response = requests.get(image_file)\n",
" image = Image.open(BytesIO(response.content)).convert('RGB')\n",
" else:\n",
" image = Image.open(image_file).convert('RGB')\n",
" return image\n",
"\n",
"image_processor = CLIPImageProcessor.from_pretrained(model.model.config.mm_vision_tower, torch_dtype=torch.float16)\n",
"\n",
"mm_use_im_start_end = getattr(model.config, \"mm_use_im_start_end\", False)\n",
"tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)\n",
"if mm_use_im_start_end:\n",
" tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)\n",
"\n",
"vision_tower = model.model.get_model().vision_tower[0]\n",
"if vision_tower.device.type == 'meta':\n",
" vision_tower = CLIPVisionModel.from_pretrained(vision_tower.config._name_or_path, torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda()\n",
" model.model.get_model().vision_tower[0] = vision_tower\n",
"else:\n",
" vision_tower.to(device='cuda', dtype=torch.float16)\n",
"vision_config = vision_tower.config\n",
"vision_config.im_patch_token = tokenizer.convert_tokens_to_ids([DEFAULT_IMAGE_PATCH_TOKEN])[0]\n",
"vision_config.use_im_start_end = mm_use_im_start_end\n",
"if mm_use_im_start_end:\n",
" vision_config.im_start_token, vision_config.im_end_token = tokenizer.convert_tokens_to_ids([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN])\n",
"image_token_len = (vision_config.image_size // vision_config.patch_size) ** 2\n",
"\n",
"qs = query\n",
"if mm_use_im_start_end:\n",
" qs = qs + '\\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN\n",
"else:\n",
" qs = qs + '\\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len\n",
"\n",
"conv_mode = \"multimodal\"\n",
"\n",
"conv = conv_templates[conv_mode].copy()\n",
"conv.append_message(conv.roles[0], qs)\n",
"conv.append_message(conv.roles[1], None)\n",
"prompt = conv.get_prompt()\n",
"inputs = tokenizer([prompt])\n",
"\n",
"image = load_image(image_file)\n",
"image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]\n",
"\n",
"input_ids = torch.as_tensor(inputs.input_ids).cuda()\n",
"\n",
"stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2\n",
"keywords = [stop_str]\n",
"stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)\n",
"\n",
"with torch.inference_mode():\n",
" output_ids = model.generate(\n",
" input_ids,\n",
" images=image_tensor.unsqueeze(0).half().cuda(),\n",
" do_sample=True,\n",
" temperature=0.2,\n",
" max_new_tokens=1024,\n",
" stopping_criteria=[stopping_criteria])\n",
"\n",
"input_token_len = input_ids.shape[1]\n",
"n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()\n",
"if n_diff_input_output > 0:\n",
" print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')\n",
"outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]\n",
"outputs = outputs.strip()\n",
"if outputs.endswith(stop_str):\n",
" outputs = outputs[:-len(stop_str)]\n",
"outputs = outputs.strip()\n",
"print(outputs)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
MODEL=facebook/opt-6.7b
# run AWQ search (optional; we provided the pre-computed results)
python -m awq.entry --entry_type search \
--model_path $MODEL \
--search_path $MODEL-awq
# generate real quantized weights (w4)
python -m awq.entry --entry_type quant \
--model_path $MODEL \
--search_path $MODEL-awq/awq_model_search_result.pt \
--quant_path $MODEL-awq
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --entry_type perplexity \
--quant_path $MODEL-awq \
--quant_file awq_model_w4_g128.pt
\ No newline at end of file
MODEL=lmsys/vicuna-7b-v1.5
# run AWQ search (optional; we provided the pre-computed results)
python -m awq.entry --entry_type search \
--model_path $MODEL \
--search_path $MODEL-awq
# generate real quantized weights (w4)
python -m awq.entry --entry_type quant \
--model_path $MODEL \
--search_path $MODEL-awq/awq_model_search_result.pt \
--quant_path $MODEL-awq
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --entry_type perplexity \
--quant_path $MODEL-awq \
--quant_file awq_model_w4_g128.pt
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment