"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "312fdd775282fc16ef7e97f2d19ca63cdcae5424"
Unverified Commit 2a002d07 authored by Younes Belkada's avatar Younes Belkada Committed by GitHub
Browse files

[`Docs` / `Awq`] Add docs on exllamav2 + AWQ (#29474)

* add docs on exllamav2 + AWQ

* Update docs/source/en/quantization.md
parent 00bf4427
...@@ -196,6 +196,45 @@ The parameter `modules_to_fuse` should include: ...@@ -196,6 +196,45 @@ The parameter `modules_to_fuse` should include:
</hfoption> </hfoption>
</hfoptions> </hfoptions>
### Exllama-v2 support
Recent versions of `autoawq` supports exllama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running:
```bash
pip install git+https://github.com/casper-hansen/AutoAWQ.git
```
Get started by passing an `AwqConfig()` with `version="exllama"`.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
quantization_config = AwqConfig(version="exllama")
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
quantization_config=quantization_config,
device_map="auto",
)
input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
output = model(input_ids)
print(output.logits)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
<Tip warning={true}>
Note this feature is supported on AMD GPUs.
</Tip>
## AutoGPTQ ## AutoGPTQ
<Tip> <Tip>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment