Commit 98f6d7b9 authored by Casper Hansen's avatar Casper Hansen
Browse files

Guide on quantized vs non-quantized

parent c88b2e25
......@@ -72,10 +72,22 @@ The detailed support list:
Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.
### INT4 GEMM vs INT4 GEMV vs FP16
There are two versions of AWQ: GEMM and GEMV. Both names to how matrix multiplication runs under the hood. We suggest the following:
- GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
- GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
- FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend [TGI](https://github.com/huggingface/text-generation-inference) or [vLLM](https://github.com/vllm-project/vllm).
### Examples
<details>
<summary>Quantization</summary>
Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment