Update install instructions

77ca8337 · Casper · ff556eb0 · 77ca8337
Commit 77ca8337 authored Aug 27, 2023 by Casper
Hide whitespace changes
Inline Side-by-side

Showing with 24 additions and 5 deletions

README.md README.md +24 -5

No files found.
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ AutoAWQ is a package that implements the Activation-aware Weight Quantization (A

 Roadmap:

- [ ] Publish pip package
+- [x] Publish pip package
 - [ ] Refactor quantization code
 - [ ] Support more models
 - [ ] Optimize the speed of models
@@ -14,7 +14,18 @@ Roadmap:
 Requirements: 
 - Compute Capability 8.0 (sm80). Ampere and later architectures are supported.

-Clone this repository and install with pip.
+Install:
+- Use pip to install awq
+
+```
+pip install awq
+```
+
+### Build source
+
+<details>
+
+<summary>Build AutoAWQ from scratch</summary>

 ```
 git clone https://github.com/casper-hansen/AutoAWQ
@@ -22,6 +33,8 @@ cd AutoAWQ
 pip install -e .
 ```

+</details>
+
 ## Supported models

 The detailed support list:
@@ -36,6 +49,7 @@ The detailed support list:
 | OPT      | 125m/1.3B/2.7B/6.7B/13B/30B |
 | Bloom    | 560m/3B/7B/                 |
 | LLaVA-v0 | 13B                         |
+| GPTJ     | 6.7B                        |

 ## Usage

@@ -44,8 +58,8 @@ Below, you will find examples for how to easily quantize a model and run inferen
 ### Quantization

 ```python
+from awq import AutoAWQForCausalLM
 from transformers import AutoTokenizer
-from awq.models.auto import AutoAWQForCausalLM

 model_path = 'lmsys/vicuna-7b-v1.5'
 quant_path = 'vicuna-7b-v1.5-awq'
@@ -68,8 +82,8 @@ tokenizer.save_pretrained(quant_path)
 Run inference on a quantized model from Huggingface:

 ```python
+from awq import AutoAWQForCausalLM
 from transformers import AutoTokenizer
-from awq.models.auto import AutoAWQForCausalLM

 quant_path = "casperhansen/vicuna-7b-v1.5-awq"
 quant_file = "awq_model_w4_g128.pt"
@@ -101,8 +115,11 @@ Benchmark speeds may vary from server to server and that it also depends on your
 | MPT-30B     | A6000 | OOM               | 31.57             | --      |
 | Falcon-7B   | A6000 | 39.44             | 27.34             | 1.44x   |

+<details>

-For example, here is the difference between a fast and slow CPU on MPT-7B:
+<summary>Detailed benchmark (CPU vs. GPU)</summary>
+
+Here is the difference between a fast and slow CPU on MPT-7B:

 RTX 4090 + Intel i9 13900K (2 different VMs):
 - CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
@@ -113,6 +130,8 @@ RTX 4090 + AMD EPYC 7-Series (3 different VMs):
 - CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
 - CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

+</details>
+
 ## Reference

 If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):