[`Docs` / `Awq`] Add docs on exllamav2 + AWQ (#29474)

* add docs on exllamav2 + AWQ * Update docs/source/en/quantization.md

[`Docs` / `Awq`] Add docs on exllamav2 + AWQ (#29474)
* add docs on exllamav2 + AWQ * Update docs/source/en/quantization.md
2a002d07 · Younes Belkada · GitHub · 00bf4427 · 2a002d07
Unverified Commit 2a002d07 authored Mar 06, 2024 by Younes Belkada Committed by GitHub Mar 06, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 39 additions and 0 deletions

docs/source/en/quantization.md docs/source/en/quantization.md +39 -0

No files found.
--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@@ -196,6 +196,45 @@ The parameter `modules_to_fuse` should include:
 </hfoption>
 </hfoptions>
+### Exllama-v2 support
+Recent versions of `autoawq` supports exllama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running:
+```bash
+pip install git+https://github.com/casper-hansen/AutoAWQ.git
+```
+Get started by passing an `AwqConfig()` with `version="exllama"`.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
+quantization_config = AwqConfig(version="exllama")
+model = AutoModelForCausalLM.from_pretrained(
+    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
+    quantization_config=quantization_config,
+    device_map="auto",
+)
+input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
+output = model(input_ids)
+print(output.logits)
+tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
+input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
+output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+<Tip warning={true}>
+Note this feature is supported on AMD GPUs.
+</Tip>
 ## AutoGPTQ
 <Tip>