[Minor] Update README.

34bbd4c1 · Haotian Tang · f22c2a35 · 34bbd4c1 · 34bbd4c1
Commit 34bbd4c1 authored Aug 01, 2023 by Haotian Tang
Hide whitespace changes
Inline Side-by-side

Showing with 52 additions and 4 deletions

README.md README.md +4 -2

tinychat/README.md tinychat/README.md +48 -2

No files found.
--- a/README.md
+++ b/README.md
@@ -14,11 +14,13 @@ The current release supports:
 ![TinyChat on RTX 4090: W4A16 is 2.3x faster than FP16](./tinychat/figures/4090_example.gif)
-Check out [TinyChat](tinychat), which delievers 2.3x faster inference performance for the **LLaMA-2** chatbot on RTX 4090!
+Check out [TinyChat](tinychat), which delievers 2.3x faster inference performance for the **LLaMA-2** chatbot on RTX 4090! 
+It also offers a turn-key solution for **on-device inferecne** of LLMs on **resource-constrained edge platforms**. With TinyChat, it is now possible to run **large** models on **small** and **low-power** devices even without Internet connection.
 ## News
- [2023/07] 🔥 We released **TinyChat**, an efficient and minimal chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. LLama-2-chat models are supported! Check out our implementation [here](tinychat).
+- [2023/07] 🔥 We released **TinyChat**, an efficient and lightweight chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. LLama-2-chat models are supported! Check out our implementation [here](tinychat).
 - [2023/07] 🔥 We added AWQ support and pre-computed search results for Llama-2 models (7B & 13B). Checkout our model zoo [here](https://huggingface.co/datasets/mit-han-lab/awq-model-zoo)!
 - [2023/07] We extended the support for more LLM models including MPT, Falcon, and BLOOM. 

--- a/tinychat/README.md
+++ b/tinychat/README.md
-# TinyChat: Efficient and Minimal Chatbot with AWQ
+# TinyChat: Efficient and Lightweight Chatbot with AWQ
-We introduce TinyChat, a cutting-edge chatbot interface designed for minimal resource consumption and fast inference speed on GPU platforms. It allows for seamless deployment on consumer-level GPUs such as 3090/4090 and low-power edge devices like the NVIDIA Jetson Orin, empowering users with a responsive conversational experience like never before.
+We introduce TinyChat, a cutting-edge chatbot interface designed for lightweight resource consumption and fast inference speed on GPU platforms. It allows for seamless deployment on consumer-level GPUs such as 3090/4090 and low-power edge devices like the NVIDIA Jetson Orin, empowering users with a responsive conversational experience like never before.
@@ -20,6 +20,8 @@ The current release supports:
 - [Examples](#examples)
+- [Benchmarks](#benchmarks)
 - [Usage](#usage)
 - [Reference](#reference)
@@ -38,6 +40,50 @@ Thanks to AWQ, TinyChat can now deliver more prompt responses through 4-bit infe
 ![TinyChat on Jetson Orin: W4A16 is 1.4x faster than FP16](./figures/orin_example.gif)
+## Benchmarks
+We benchmark TinyChat on A6000 (server-class GPU), 4090 (desktop GPU) and Orin (edge GPU).
+We use the default implementation from Huggingface for the FP16 baseline. The INT4 implementation applies AWQ and utilizes our fast W4A16 GPU kernel. Please notice that the end-to-end runtime for INT4 TinyChat could be further improved if we reduce the framework overhead from Huggingface (e.g. utilizing the implementation from TGI). 
+The latency reported in all tables are per-token latency for the generation stage.
+### A6000 Results
+| Model       | FP16 latency (ms) | INT4 latency (ms) | Speedup |
+| ----------- |:-----------------:|:-----------------:|:-------:|
+| LLaMA-2-7B  | 27.14             | 12.44             | 2.18x   |
+| LLaMA-2-13B | 47.28             | 20.28             | 2.33x   |
+| Vicuna-7B   | 26.06             | 12.43             | 2.10x   |
+| Vicuna-13B  | 44.91             | 17.30             | 2.60x   |
+| MPT-7B      | 22.79             | 16.87             | 1.35x   |
+| MPT-30B     | OOM               | 31.57             | --      |
+| Falcon-7B   | 39.44             | 27.34             | 1.44x   |
+### 4090 Results
+| Model       | FP16 latency (ms) | INT4 latency (ms) | Speedup |
+| ----------- |:-----------------:|:-----------------:|:-------:|
+| LLaMA-2-7B  | 19.97             | 8.66              | 2.31x   |
+| LLaMA-2-13B | OOM               | 13.54             | --      |
+| Vicuna-7B   | 19.09             | 8.61              | 2.22x   |
+| Vicuna-13B  | OOM               | 12.17             | --      |
+| MPT-7B      | 17.09             | 12.58             | 1.36x   |
+| MPT-30B     | OOM               | 23.54             | --      |
+| Falcon-7B   | 29.91             | 19.84             | 1.51x   |
+### Orin Results
+| Model       | FP16 latency (ms) | INT4 latency (ms) | Speedup |
+| ----------- |:-----------------:|:-----------------:|:-------:|
+| LLaMA-2-7B  | 104.71            | 75.11             | 1.39x   |
+| LLaMA-2-13B | OOM               | 136.81            | --      |
+| Vicuna-7B   | 93.12             | 65.34             | 1.43x   |
+| Vicuna-13B  | OOM               | 115.4             | --      |
+| MPT-7B      | 89.85             | 67.36             | 1.33x   |
+| Falcon-7B   | 147.84            | 102.74            | 1.44x   |
 ## Usage
 1. Please follow the [AWQ installation guidance](https://github.com/mit-han-lab/llm-awq#readme) to install AWQ and its dependencies.