Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
AutoAWQ
Commits
77ca8337
Commit
77ca8337
authored
Aug 27, 2023
by
Casper
Browse files
Update install instructions
parent
ff556eb0
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
24 additions
and
5 deletions
+24
-5
README.md
README.md
+24
-5
No files found.
README.md
View file @
77ca8337
...
@@ -4,7 +4,7 @@ AutoAWQ is a package that implements the Activation-aware Weight Quantization (A
...
@@ -4,7 +4,7 @@ AutoAWQ is a package that implements the Activation-aware Weight Quantization (A
Roadmap:
Roadmap:
-
[
] Publish pip package
-
[
x
] Publish pip package
-
[ ] Refactor quantization code
-
[ ] Refactor quantization code
-
[ ] Support more models
-
[ ] Support more models
-
[ ] Optimize the speed of models
-
[ ] Optimize the speed of models
...
@@ -14,7 +14,18 @@ Roadmap:
...
@@ -14,7 +14,18 @@ Roadmap:
Requirements:
Requirements:
-
Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
-
Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
Clone this repository and install with pip.
Install:
-
Use pip to install awq
```
pip install awq
```
### Build source
<details>
<summary>
Build AutoAWQ from scratch
</summary>
```
```
git clone https://github.com/casper-hansen/AutoAWQ
git clone https://github.com/casper-hansen/AutoAWQ
...
@@ -22,6 +33,8 @@ cd AutoAWQ
...
@@ -22,6 +33,8 @@ cd AutoAWQ
pip install -e .
pip install -e .
```
```
</details>
## Supported models
## Supported models
The detailed support list:
The detailed support list:
...
@@ -36,6 +49,7 @@ The detailed support list:
...
@@ -36,6 +49,7 @@ The detailed support list:
| OPT | 125m/1.3B/2.7B/6.7B/13B/30B |
| OPT | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom | 560m/3B/7B/ |
| Bloom | 560m/3B/7B/ |
| LLaVA-v0 | 13B |
| LLaVA-v0 | 13B |
| GPTJ | 6.7B |
## Usage
## Usage
...
@@ -44,8 +58,8 @@ Below, you will find examples for how to easily quantize a model and run inferen
...
@@ -44,8 +58,8 @@ Below, you will find examples for how to easily quantize a model and run inferen
### Quantization
### Quantization
```
python
```
python
from
awq
import
AutoAWQForCausalLM
from
transformers
import
AutoTokenizer
from
transformers
import
AutoTokenizer
from
awq.models.auto
import
AutoAWQForCausalLM
model_path
=
'lmsys/vicuna-7b-v1.5'
model_path
=
'lmsys/vicuna-7b-v1.5'
quant_path
=
'vicuna-7b-v1.5-awq'
quant_path
=
'vicuna-7b-v1.5-awq'
...
@@ -68,8 +82,8 @@ tokenizer.save_pretrained(quant_path)
...
@@ -68,8 +82,8 @@ tokenizer.save_pretrained(quant_path)
Run inference on a quantized model from Huggingface:
Run inference on a quantized model from Huggingface:
```
python
```
python
from
awq
import
AutoAWQForCausalLM
from
transformers
import
AutoTokenizer
from
transformers
import
AutoTokenizer
from
awq.models.auto
import
AutoAWQForCausalLM
quant_path
=
"casperhansen/vicuna-7b-v1.5-awq"
quant_path
=
"casperhansen/vicuna-7b-v1.5-awq"
quant_file
=
"awq_model_w4_g128.pt"
quant_file
=
"awq_model_w4_g128.pt"
...
@@ -101,8 +115,11 @@ Benchmark speeds may vary from server to server and that it also depends on your
...
@@ -101,8 +115,11 @@ Benchmark speeds may vary from server to server and that it also depends on your
| MPT-30B | A6000 | OOM | 31.57 | -- |
| MPT-30B | A6000 | OOM | 31.57 | -- |
| Falcon-7B | A6000 | 39.44 | 27.34 | 1.44x |
| Falcon-7B | A6000 | 39.44 | 27.34 | 1.44x |
<details>
For example, here is the difference between a fast and slow CPU on MPT-7B:
<summary>
Detailed benchmark (CPU vs. GPU)
</summary>
Here is the difference between a fast and slow CPU on MPT-7B:
RTX 4090 + Intel i9 13900K (2 different VMs):
RTX 4090 + Intel i9 13900K (2 different VMs):
-
CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
-
CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
...
@@ -113,6 +130,8 @@ RTX 4090 + AMD EPYC 7-Series (3 different VMs):
...
@@ -113,6 +130,8 @@ RTX 4090 + AMD EPYC 7-Series (3 different VMs):
-
CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
-
CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
-
CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
-
CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
</details>
## Reference
## Reference
If you find AWQ useful or relevant to your research, you can cite their
[
paper
](
https://arxiv.org/abs/2306.00978
)
:
If you find AWQ useful or relevant to your research, you can cite their
[
paper
](
https://arxiv.org/abs/2306.00978
)
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment