Commit 98f6d7b9 authored by Casper Hansen's avatar Casper Hansen
Browse files

Guide on quantized vs non-quantized

parent c88b2e25
...@@ -72,10 +72,22 @@ The detailed support list: ...@@ -72,10 +72,22 @@ The detailed support list:
Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models. Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.
### INT4 GEMM vs INT4 GEMV vs FP16
There are two versions of AWQ: GEMM and GEMV. Both names to how matrix multiplication runs under the hood. We suggest the following:
- GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
- GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
- FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend [TGI](https://github.com/huggingface/text-generation-inference) or [vLLM](https://github.com/vllm-project/vllm).
### Examples
<details> <details>
<summary>Quantization</summary> <summary>Quantization</summary>
Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.
```python ```python
from awq import AutoAWQForCausalLM from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer from transformers import AutoTokenizer
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment