AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the [original work](https://github.com/mit-han-lab/llm-awq) from MIT.
*Latest News* 🔥
- [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon). LLaMa 7B - up to 180 tokens/s.
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
...
...
@@ -42,6 +43,8 @@ pip install autoawq
<summary>Build AutoAWQ from scratch</summary>
Build time can take 10 minutes. Download your model while you install AutoAWQ.
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {prompt}
ASSISTANT:"""
tokens=tokenizer(
prompt_template.format(prompt="How are you today?"),
return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output=model.generate(
tokens,
streamer=streamer,
max_new_tokens=512
)
```
## Benchmarks
</details>
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.