"Please get the Vicuna model from [FastChat](https://github.com/lm-sys/FastChat) and run the following command to generate a quantized model checkpoint first.\n",
"[Warning] Calling a fake MLP fusion. But still faster than Huggingface Implimentation.\n"
]
},
{
{
"data": {
"data": {
"text/plain": [
"text/plain": [
...
@@ -162,69 +94,47 @@
...
@@ -162,69 +94,47 @@
")"
")"
]
]
},
},
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"metadata": {},
"output_type": "execute_result"
"output_type": "execute_result"
}
}
],
],
"source": [
"source": [
"make_quant_attn(model, \"cuda:0\")\n",
"make_quant_attn(model.model, \"cuda:0\")\n",
"make_quant_norm(model)\n",
"make_quant_norm(model.model)\n",
"make_fused_mlp(model)"
"make_fused_mlp(model.model)"
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"USER: Show me some attractions in Boston.\n"
]
},
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"ASSISTANT: 1. Boston Public Library\n",
"ASSISTANT: Sure! Here are some popular tourist attractions in Boston:\n",
"2. Fenway Park\n",
"\n",
"3. Harvard Square\n",
"1. Freedom Trail - a 2.5-mile walking trail that takes you through some of the most important historical sites in Boston, including Paul Revere's House, the Old North Church, and the site of the Boston Massacre.\n",
"4. Boston Common\n",
"2. Fenway Park - home to the Boston Red Sox baseball team, this historic ballpark is one of the oldest in Major League Baseball.\n",
"5. Freedom Trail\n",
"3. Museum of Fine Arts - one of the largest art museums in the country, with a collection of over 450,000 works of art from around the world.\n",
"6. Museum of Fine Arts\n",
"4. Boston Harbor Islands National Recreation Area - a group of islands located just offshore from downtown Boston that offer stunning views of the city skyline and easy access to outdoor recreational activities like hiking and kayaking.\n",
"7. Isabella Stewart Gardner Museum\n",
"5. New England Aquarium - one of the oldest and largest aquariums in the United States, featuring a wide variety of marine life, including giant whales and colorful fish.\n",
"8. Paul Revere House\n",
"6. The USS Constitution Museum - located on board the USS Constitution, a historic ship that played a key role in the War of 1812 and is still in active service today.\n",
"9. New England Aquarium\n",
"7. Bunker Hill Monument - a 221-foot-tall obelisk located in Charlestown that commemorates the Battle of Bunker Hill during the Revolutionary War.\n",
"10. Museum of Science\n",
"8. The Hancock Building - a historic building in the heart of Boston that offers panoramic views of the city from its observation deck.\n",
@@ -7,23 +7,16 @@ We introduce TinyChat, a cutting-edge chatbot interface designed for lightweight
...
@@ -7,23 +7,16 @@ We introduce TinyChat, a cutting-edge chatbot interface designed for lightweight
The current release supports:
The current release supports:
- LLaMA-2-7B/13B-chat;
- LLaMA-2-7B/13B-chat;
- Vicuna;
- Vicuna;
- MPT-chat;
- MPT-chat;
- Falcon-instruct.
- Falcon-instruct.
## Contents
## Contents
-[Examples](#examples)
-[Examples](#examples)
-[Benchmarks](#benchmarks)
-[Benchmarks](#benchmarks)
-[Usage](#usage)
-[Usage](#usage)
-[Reference](#reference)
-[Reference](#reference)
...
@@ -91,73 +84,27 @@ The latency reported in all tables are per-token latency for the generation stag
...
@@ -91,73 +84,27 @@ The latency reported in all tables are per-token latency for the generation stag
2. Download the pretrained instruction-tuned LLMs:
2. Download the pretrained instruction-tuned LLMs:
- For LLaMA-2-chat, please refer to [this link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf);
- For LLaMA-2-chat, please refer to [this link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf);
- For Vicuna, please refer to [this link](https://huggingface.co/lmsys/);
- For Vicuna, please refer to [this link](https://huggingface.co/lmsys/);
- For MPT-chat, please refer to [this link](https://huggingface.co/mosaicml/mpt-7b-chat);
- For MPT-chat, please refer to [this link](https://huggingface.co/mosaicml/mpt-7b-chat);
- For Falcon-instruct, please refer to [this link](https://huggingface.co/tiiuae/falcon-7b-instruct).
- For Falcon-instruct, please refer to [this link](https://huggingface.co/tiiuae/falcon-7b-instruct).
3. Quantize instruction-tuned LLMs with AWQ (see [usage in README](../README.md#usage)).
3. Quantize instruction-tuned LLMs with AWQ:
- We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:
```bash
# git lfs install # install git lfs if not already
Alternatively, you may go through the process step by step. We will demonstrate the quantization process with LLaMA-2. For all other models except Falcon, one only needs to change the `model_path` and saving locations. For Falcon-7B, we also need to change `q_group_size` from 128 to 64.
- Perform AWQ search and save search results (we already did it for you):
Note: if you use Falcon-7B-instruct, please remember to also change `q_group_size` to 64. You may also run the following command to execute the chatbot in FP16 to compare the speed and quality of language generation:
You may also run the following command to execute the chatbot in FP16 to compare the speed and quality of language generation:
TinyChat is inspired by the following open-source projects: [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), [vLLM](https://github.com/vllm-project/vllm), [FastChat](https://github.com/lm-sys/FastChat).
TinyChat is inspired by the following open-source projects: [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), [vLLM](https://github.com/vllm-project/vllm), [FastChat](https://github.com/lm-sys/FastChat).