

Check out [TinyChat](tinychat), which delievers 2.3x faster inference performance for the **LLaMA-2** chatbot on RTX 4090!
Check out [TinyChat](tinychat), which delievers 2.3x faster inference performance for the **LLaMA-2** chatbot on RTX 4090!
It also offers a turn-key solution for **on-device inferecne** of LLMs on **resource-constrained edge platforms**. With TinyChat, it is now possible to run **large** models on **small** and **low-power** devices even without Internet connection.
## News
## News
-[2023/07] 🔥 We released **TinyChat**, an efficient and minimal chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. LLama-2-chat models are supported! Check out our implementation [here](tinychat).
-[2023/07] 🔥 We released **TinyChat**, an efficient and lightweight chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. LLama-2-chat models are supported! Check out our implementation [here](tinychat).
-[2023/07] 🔥 We added AWQ support and pre-computed search results for Llama-2 models (7B & 13B). Checkout our model zoo [here](https://huggingface.co/datasets/mit-han-lab/awq-model-zoo)!
-[2023/07] 🔥 We added AWQ support and pre-computed search results for Llama-2 models (7B & 13B). Checkout our model zoo [here](https://huggingface.co/datasets/mit-han-lab/awq-model-zoo)!
- [2023/07] We extended the support for more LLM models including MPT, Falcon, and BLOOM.
- [2023/07] We extended the support for more LLM models including MPT, Falcon, and BLOOM.
# TinyChat: Efficient and Minimal Chatbot with AWQ
# TinyChat: Efficient and Lightweight Chatbot with AWQ
We introduce TinyChat, a cutting-edge chatbot interface designed for minimal resource consumption and fast inference speed on GPU platforms. It allows for seamless deployment on consumer-level GPUs such as 3090/4090 and low-power edge devices like the NVIDIA Jetson Orin, empowering users with a responsive conversational experience like never before.
We introduce TinyChat, a cutting-edge chatbot interface designed for lightweight resource consumption and fast inference speed on GPU platforms. It allows for seamless deployment on consumer-level GPUs such as 3090/4090 and low-power edge devices like the NVIDIA Jetson Orin, empowering users with a responsive conversational experience like never before.
...
@@ -20,6 +20,8 @@ The current release supports:
...
@@ -20,6 +20,8 @@ The current release supports:
-[Examples](#examples)
-[Examples](#examples)
-[Benchmarks](#benchmarks)
-[Usage](#usage)
-[Usage](#usage)
-[Reference](#reference)
-[Reference](#reference)
...
@@ -38,6 +40,50 @@ Thanks to AWQ, TinyChat can now deliver more prompt responses through 4-bit infe
...
@@ -38,6 +40,50 @@ Thanks to AWQ, TinyChat can now deliver more prompt responses through 4-bit infe


## Benchmarks
We benchmark TinyChat on A6000 (server-class GPU), 4090 (desktop GPU) and Orin (edge GPU).
We use the default implementation from Huggingface for the FP16 baseline. The INT4 implementation applies AWQ and utilizes our fast W4A16 GPU kernel. Please notice that the end-to-end runtime for INT4 TinyChat could be further improved if we reduce the framework overhead from Huggingface (e.g. utilizing the implementation from TGI).
The latency reported in all tables are per-token latency for the generation stage.