Commit 2a243105 authored by Casper Hansen's avatar Casper Hansen
Browse files

Modify README and examples

parent fd5b8c88
......@@ -71,13 +71,16 @@ git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
The detailed support list:
| Models | Sizes | INT4-g128 | INT3-g128 |
| ------ | --------------------------- | --------- | --------- |
| LLaMA-2 | 7B/7B-chat/13B/13B-chat | ✅ | ✅ |
| LLaMA | 7B/13B/30B/65B | ✅ | ✅ |
| OPT | 125m/1.3B/2.7B/6.7B/13B/30B | ✅ | ✅ |
| Vicuna-v1.1 | 7B/13B | ✅ | |
| LLaVA-v0 | 13B | ✅ | |
| Models | Sizes | INT4-g128 | INT3-g128 |
| ---------| ----------------------------| -----------| --------- |
| LLaMA-2 | 7B/13B/70B | ✅ | ✅ |
| LLaMA | 7B/13B/30B/65B | ✅ | ✅ |
| Vicuna | 7B/13B | ✅ | |
| MPT | 7B/30B | ✅ | |
| Falcon | 7B/40B | ✅ | |
| OPT | 125m/1.3B/2.7B/6.7B/13B/30B | ✅ | ✅ |
| Bloom | 560m/3B/7B/ | ✅ | ✅ |
| LLaVA-v0 | 13B | ✅ | |
## Examples
......@@ -91,39 +94,30 @@ Note that we perform AWQ using only textual calibration data, depsite we are run
## Usage
We provide several sample script to run AWQ (please refer to `./scripts`). We use OPT-6.7B as an example.
We provide several sample script to run AWQ (please refer to `./scripts`). We use Vicuna 7B v1.5 as an example.
1. Perform AWQ search and save search results (we already did it for you):
1. Perform AWQ search and save search results
```bash
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/opt-6.7b-w4-g128.pt
python -m awq.entry --entry_type search \
--model_path lmsys/vicuna-7b-v1.5 \
--search_path vicuna-7b-v1.5-awq
```
2. Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
```bash
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/opt-6.7b-w4-g128.pt \
--q_backend fake
```
Note: if you use Falcon 7B, please pass `--q_group_size 64` in order for it to work.
3. Generate real quantized weights (INT4)
2. Generate quantized weights and save them (INT4)
```bash
mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/opt-6.7b-w4-g128.pt \
--q_backend real --dump_quant quant_cache/opt-6.7b-w4-g128-awq.pt
python -m awq.entry --entry_type quant \
--model_path lmsys/vicuna-7b-v1.5 \
--search_path vicuna-7b-v1.5-awq/awq_model_search_result.pt \
--quant_path vicuna-7b-v1.5-awq
```
4. Load and evaluate the real quantized model (now you can see smaller gpu memory usage)
3. Load and evaluate the perplexity of the real quantized model weights (faster and uses less memory)
```bash
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_quant quant_cache/opt-6.7b-w4-g128-awq.pt
python -m awq.entry --entry_type perplexity \
--quant_path vicuna-7b-v1.5-awq \
--quant_file awq_model_w4_g128.pt
```
## Reference
......
......@@ -31,10 +31,8 @@
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"from accelerate import init_empty_weights, load_checkpoint_and_dispatch\n",
"from awq.quantize.quantizer import real_quantize_model_weight\n",
"from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig\n",
"from awq.models.auto import AutoAWQForCausalLM\n",
"from transformers import AutoTokenizer\n",
"from tinychat.demo import gen_params, stream_output\n",
"from tinychat.stream_generators import StreamGenerator\n",
"from tinychat.modules import make_quant_norm, make_quant_attn, make_fused_mlp\n",
......@@ -44,98 +42,32 @@
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please get the Vicuna model from [FastChat](https://github.com/lm-sys/FastChat) and run the following command to generate a quantized model checkpoint first.\n",
"\n",
"```bash\n",
"mkdir quant_cache\n",
"python -m awq.entry --model_path [vicuna-7b_model_path] \\\n",
" --w_bit 4 --q_group_size 128 \\\n",
" --load_awq awq_cache/vicuna-7b-w4-g128.pt \\\n",
" --q_backend real --dump_quant quant_cache/vicuna-7b-w4-g128-awq.pt\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# model_path = \"\" # the path of vicuna-7b model\n",
"# load_quant_path = \"quant_cache/vicuna-7b-w4-g128-awq.pt\"\n",
"model_path = \"/data/llm/checkpoints/vicuna-hf/vicuna-7b\"\n",
"load_quant_path = \"/data/llm/checkpoints/vicuna-hf/vicuna-7b-awq-w4g128.pt\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We first load a empty model and replace all the linear layers with WQLinear layers. Then we load the quantized weights from the checkpoint. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8b79a82b73ab4d9191ba54f5d0f8cb86",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"real weight quantization...(init only): 100%|███████████████████| 32/32 [00:11<00:00, 2.69it/s]\n",
"The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.\n",
"The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.\n"
"Replacing layers...: 100%|██████████| 32/32 [00:02<00:00, 11.85it/s]\n"
]
}
],
"source": [
"config = AutoConfig.from_pretrained(model_path)\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)\n",
"with init_empty_weights():\n",
" model = AutoModelForCausalLM.from_pretrained(model_path, config=config,\n",
" torch_dtype=torch.float16)\n",
"q_config = {\"zero_point\": True, \"q_group_size\": 128}\n",
"real_quantize_model_weight(\n",
" model, w_bit=4, q_config=q_config, init_only=True)\n",
"model_path = 'vicuna-7b-v1.5-awq'\n",
"quant_file = 'awq_model_w4_g128.pt'\n",
"\n",
"model = load_checkpoint_and_dispatch(\n",
" model, load_quant_path,\n",
" device_map=\"auto\",\n",
" no_split_module_classes=[\"LlamaDecoderLayer\"]\n",
")"
"tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)\n",
"model = AutoAWQForCausalLM.from_quantized(model_path, quant_file)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Warning] Calling a fake MLP fusion. But still faster than Huggingface Implimentation.\n"
]
},
{
"data": {
"text/plain": [
......@@ -162,69 +94,47 @@
")"
]
},
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"make_quant_attn(model, \"cuda:0\")\n",
"make_quant_norm(model)\n",
"make_fused_mlp(model)"
"make_quant_attn(model.model, \"cuda:0\")\n",
"make_quant_norm(model.model)\n",
"make_fused_mlp(model.model)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"USER: Show me some attractions in Boston.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ASSISTANT: 1. Boston Public Library\n",
"2. Fenway Park\n",
"3. Harvard Square\n",
"4. Boston Common\n",
"5. Freedom Trail\n",
"6. Museum of Fine Arts\n",
"7. Isabella Stewart Gardner Museum\n",
"8. Paul Revere House\n",
"9. New England Aquarium\n",
"10. Museum of Science\n",
"ASSISTANT: Sure! Here are some popular tourist attractions in Boston:\n",
"\n",
"1. Freedom Trail - a 2.5-mile walking trail that takes you through some of the most important historical sites in Boston, including Paul Revere's House, the Old North Church, and the site of the Boston Massacre.\n",
"2. Fenway Park - home to the Boston Red Sox baseball team, this historic ballpark is one of the oldest in Major League Baseball.\n",
"3. Museum of Fine Arts - one of the largest art museums in the country, with a collection of over 450,000 works of art from around the world.\n",
"4. Boston Harbor Islands National Recreation Area - a group of islands located just offshore from downtown Boston that offer stunning views of the city skyline and easy access to outdoor recreational activities like hiking and kayaking.\n",
"5. New England Aquarium - one of the oldest and largest aquariums in the United States, featuring a wide variety of marine life, including giant whales and colorful fish.\n",
"6. The USS Constitution Museum - located on board the USS Constitution, a historic ship that played a key role in the War of 1812 and is still in active service today.\n",
"7. Bunker Hill Monument - a 221-foot-tall obelisk located in Charlestown that commemorates the Battle of Bunker Hill during the Revolutionary War.\n",
"8. The Hancock Building - a historic building in the heart of Boston that offers panoramic views of the city from its observation deck.\n",
"==================================================\n",
"Speed of Inference\n",
"--------------------------------------------------\n",
"Context Stage : 7.18 ms/token\n",
"Generation Stage : 9.49 ms/token\n",
"Average Speed : 8.53 ms/token\n",
"==================================================\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
"USER: \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Generation Stage : 10.13 ms/token\n",
"==================================================\n",
"EXIT...\n"
]
}
],
"source": [
"model_prompter = get_prompter(\"llama\", model_path)\n",
"model_prompter = get_prompter(model, model_path)\n",
"stream_generator = StreamGenerator\n",
"count = 0\n",
"while True:\n",
......@@ -239,20 +149,13 @@
" model_prompter.update_template(outputs)\n",
" count += 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (awq)",
"display_name": "Python 3",
"language": "python",
"name": "awq"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
......@@ -264,7 +167,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
"version": "3.10.6"
}
},
"nbformat": 4,
......
......@@ -55,17 +55,25 @@
"from llava.utils import disable_torch_init\n",
"from llava.model import *\n",
"from llava.model.utils import KeywordsStoppingCriteria\n",
"from awq.quantize.pre_quant import apply_awq\n",
"from awq.quantize.quantizer import real_quantize_model_weight\n",
"from awq.models.auto import AutoAWQForCausalLM\n",
"import os\n",
"import gc\n",
"\n",
"from awq.quantize.auto_clip import apply_clip\n",
"from awq.quantize.auto_scale import apply_scale\n",
"\n",
"# This demo only support single GPU for now\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
"DEFAULT_IMAGE_TOKEN = \"<image>\"\n",
"DEFAULT_IMAGE_PATCH_TOKEN = \"<im_patch>\"\n",
"DEFAULT_IM_START_TOKEN = \"<im_start>\"\n",
"DEFAULT_IM_END_TOKEN = \"<im_end>\""
"DEFAULT_IM_END_TOKEN = \"<im_end>\"\n",
"\n",
"def load_search_result_into_memory(model, search_path):\n",
" awq_results = torch.load(search_path, map_location=\"cpu\")\n",
" \n",
" apply_scale(model, awq_results[\"scale\"])\n",
" apply_clip(model, awq_results[\"clip\"])"
]
},
{
......@@ -91,20 +99,25 @@
}
],
"source": [
"model_path = \"/dataset/llava/LLaVA-13B-v0\" # Please change here \n",
"quant_path = \"../quant_cache/LLaVA-13B-v0-w4-g128-awq.pt\" # place to dump quant weights\n",
"model_path = \"liuhaotian/LLaVA-13b-delta-v0\"\n",
"quant_path = \"LLaVA-13B-v0-awq\" # place to dump quant weights\n",
"search_path = \"../awq_cache/llava-13b-v0-w4-g128.pt\" # place where you stored search results\n",
"\n",
"model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, torch_dtype=torch.float16, use_cache=True).cuda()\n",
"# Load model\n",
"model = AutoAWQForCausalLM.from_pretrained(model_path)\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n",
"quant_config = {\"zero_point\": True, \"q_group_size\": 128, \"w_bit\": 4}\n",
"\n",
"awq_results = torch.load(\"../awq_cache/llava-13b-v0-w4-g128.pt\", map_location=\"cpu\")\n",
"apply_awq(model, awq_results)\n",
"# Load model and search results\n",
"load_search_result_into_memory(model.model, search_path)\n",
"\n",
"real_quantize_model_weight(model, w_bit=4, q_config={\"zero_point\": True, \"q_group_size\": 128})\n",
"torch.save(model.cpu().state_dict(), quant_path)\n",
"# Run actual weight quantization\n",
"model.quantize(quant_config=quant_config, run_search=False, run_quant=True)\n",
"\n",
"del model\n",
"gc.collect()\n",
"torch.cuda.empty_cache()"
"# Save quantized model\n",
"model.save_quantized(quant_path)\n",
"\n",
"del model"
]
},
{
......@@ -152,20 +165,11 @@
}
],
"source": [
"\n",
"disable_torch_init()\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path)\n",
"config = LlavaConfig.from_pretrained(model_path)\n",
"with init_empty_weights():\n",
" model = LlavaLlamaForCausalLM.from_pretrained(model_path, config=config,\n",
" torch_dtype=torch.float16, device_map=\"auto\")\n",
"q_config = {\"zero_point\": True, \"q_group_size\": 128}\n",
"real_quantize_model_weight(\n",
" model, w_bit=4, q_config=q_config, init_only=True)\n",
"\n",
"model = load_checkpoint_and_dispatch(\n",
" model, quant_path, device_map=\"auto\"\n",
")"
"# Load model\n",
"model = AutoAWQForCausalLM.from_quantized(quant_path, quant_filename=\"awq_model_w4_g128.pt\")\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)"
]
},
{
......@@ -198,17 +202,17 @@
" image = Image.open(image_file).convert('RGB')\n",
" return image\n",
"\n",
"image_processor = CLIPImageProcessor.from_pretrained(model.config.mm_vision_tower, torch_dtype=torch.float16)\n",
"image_processor = CLIPImageProcessor.from_pretrained(model.model.config.mm_vision_tower, torch_dtype=torch.float16)\n",
"\n",
"mm_use_im_start_end = getattr(model.config, \"mm_use_im_start_end\", False)\n",
"tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)\n",
"if mm_use_im_start_end:\n",
" tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)\n",
"\n",
"vision_tower = model.get_model().vision_tower[0]\n",
"vision_tower = model.model.get_model().vision_tower[0]\n",
"if vision_tower.device.type == 'meta':\n",
" vision_tower = CLIPVisionModel.from_pretrained(vision_tower.config._name_or_path, torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda()\n",
" model.get_model().vision_tower[0] = vision_tower\n",
" model.model.get_model().vision_tower[0] = vision_tower\n",
"else:\n",
" vision_tower.to(device='cuda', dtype=torch.float16)\n",
"vision_config = vision_tower.config\n",
......
MODEL=llama-7b
# run AWQ search (optional; we provided the pre-computed results)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt
# evaluate the AWQ quantize model (simulated pseudo quantization)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/$MODEL-w4-g128.pt \
--q_backend fake
# generate real quantized weights (w4)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/$MODEL-w4-g128.pt \
--q_backend real --dump_quant quant_cache/$MODEL-w4-g128-awq.pt
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_quant quant_cache/$MODEL-w4-g128-awq.pt
\ No newline at end of file
MODEL=opt-6.7b
MODEL=facebook/opt-6.7b
# run AWQ search (optional; we provided the pre-computed results)
python -m awq.entry --model_path /dataset/opt/$MODEL \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt
# evaluate the AWQ quantize model (simulated pseudo quantization)
python -m awq.entry --model_path /dataset/opt/$MODEL \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/$MODEL-w4-g128.pt \
--q_backend fake
python -m awq.entry --entry_type search \
--model_path $MODEL \
--search_path $MODEL-awq
# generate real quantized weights (w4)
python -m awq.entry --model_path /dataset/opt/$MODEL \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/$MODEL-w4-g128.pt \
--q_backend real --dump_quant quant_cache/$MODEL-w4-g128-awq.pt
python -m awq.entry --entry_type quant \
--model_path $MODEL \
--search_path $MODEL-awq/awq_model_search_result.pt \
--quant_path $MODEL-awq
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --model_path /dataset/opt/$MODEL \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_quant quant_cache/$MODEL-w4-g128-awq.pt
\ No newline at end of file
python -m awq.entry --entry_type perplexity \
--quant_path $MODEL-awq \
--quant_file awq_model_w4_g128.pt
\ No newline at end of file
MODEL=lmsys/vicuna-7b-v1.5
# run AWQ search (optional; we provided the pre-computed results)
python -m awq.entry --entry_type search \
--model_path $MODEL \
--search_path $MODEL-awq
# generate real quantized weights (w4)
python -m awq.entry --entry_type quant \
--model_path $MODEL \
--search_path $MODEL-awq/awq_model_search_result.pt \
--quant_path $MODEL-awq
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --entry_type perplexity \
--quant_path $MODEL-awq \
--quant_file awq_model_w4_g128.pt
\ No newline at end of file
......@@ -7,23 +7,16 @@ We introduce TinyChat, a cutting-edge chatbot interface designed for lightweight
The current release supports:
- LLaMA-2-7B/13B-chat;
- Vicuna;
- MPT-chat;
- Falcon-instruct.
## Contents
- [Examples](#examples)
- [Benchmarks](#benchmarks)
- [Usage](#usage)
- [Reference](#reference)
......@@ -91,73 +84,27 @@ The latency reported in all tables are per-token latency for the generation stag
2. Download the pretrained instruction-tuned LLMs:
- For LLaMA-2-chat, please refer to [this link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf);
- For Vicuna, please refer to [this link](https://huggingface.co/lmsys/);
- For MPT-chat, please refer to [this link](https://huggingface.co/mosaicml/mpt-7b-chat);
- For Falcon-instruct, please refer to [this link](https://huggingface.co/tiiuae/falcon-7b-instruct).
3. Quantize instruction-tuned LLMs with AWQ:
- We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:
```bash
# git lfs install # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
```
- You may run a one-line starter below:
```bash
./scripts/llama2_demo.sh
```
Alternatively, you may go through the process step by step. We will demonstrate the quantization process with LLaMA-2. For all other models except Falcon, one only needs to change the `model_path` and saving locations. For Falcon-7B, we also need to change `q_group_size` from 128 to 64.
- Perform AWQ search and save search results (we already did it for you):
```bash
mkdir awq_cache
python -m awq.entry --model_path /PATH/TO/LLAMA2/llama-2-7b-chat \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/llama-2-7b-chat-w4-g128.pt
```
- Generate real quantized weights (INT4):
```bash
mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/LLAMA2/llama-2-7b-chat \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/llama-2-7b-chat-w4-g128.pt \
--q_backend real --dump_quant quant_cache/llama-2-7b-chat-w4-g128-awq.pt
```
3. Quantize instruction-tuned LLMs with AWQ (see [usage in README](../README.md#usage)).
4. Run the TinyChat demo:
Here, we use Vicuna as an example and assume that you have already quantized the model.
```bash
cd tinychat
python demo.py --model_type llama \
--model_path /PATH/TO/LLAMA2/llama-2-7b-chat \
--q_group_size 128 --load_quant quant_cache/llama-2-7b-chat-w4-g128-awq.pt \
    --precision W4A16
python demo.py --model_path vicuna-7b-v1.5-awq
```
Note: if you use Falcon-7B-instruct, please remember to also change `q_group_size` to 64. You may also run the following command to execute the chatbot in FP16 to compare the speed and quality of language generation:
You may also run the following command to execute the chatbot in FP16 to compare the speed and quality of language generation:
```bash
python demo.py --model_type llama \
--model_path /PATH/TO/LLAMA2/llama-2-7b-chat \
--precision W16A16
python demo.py --model_path lmsys/vicuna-7b-v1.5 --precision W16A16
```
## Reference
TinyChat is inspired by the following open-source projects: [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), [vLLM](https://github.com/vllm-project/vllm), [FastChat](https://github.com/lm-sys/FastChat).
MODEL_PATH=/data/llm/checkpoints/llama2-hf
MODEL_NAME=llama-2-7b-chat
# # Perform AWQ search and save search results (we already did it for you):
# mkdir -p awq_cache
# python -m awq.entry --model_path $MODEL_PATH/$MODEL_NAME \
# --w_bit 4 --q_group_size 128 \
# --run_awq --dump_awq awq_cache/llama-2-7b-chat-w4-g128.pt
# Generate real quantized weights (INT4):
mkdir -p quant_cache
python -m awq.entry --model_path $MODEL_PATH/$MODEL_NAME \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/llama-2-7b-chat-w4-g128.pt \
--q_backend real --dump_quant quant_cache/llama-2-7b-chat-w4-g128-awq.pt
# Run the TinyChat demo:
python demo.py --model_type llama \
--model_path $MODEL_PATH/$MODEL_NAME \
--q_group_size 128 --load_quant quant_cache/llama-2-7b-chat-w4-g128-awq.pt \
--precision W4A16
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment