Commits · 142cdabed377772b763fc8d79a131b16ed991718 · OpenDAS / text-generation-inference

12 Feb, 2024 1 commit
- feat: experimental support for cuda graphs (#1428) · 0d794af6
  OlivierDehaene authored Feb 12, 2024
```
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
```
  0d794af6
25 Sep, 2023 1 commit

Add AWQ quantization inference support (#1019) (#1054) · c5de7cd8

Nicolas Patry authored Sep 25, 2023

# Add AWQ quantization inference support

Fixes
https://github.com/huggingface/text-generation-inference/issues/781

This PR (partially) adds support for AWQ quantization for inference.
More information on AWQ [here](https://arxiv.org/abs/2306.00978). In
general, AWQ is faster and more accurate than GPTQ, which is currently
supported by TGI.

This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors
(in `requirements.txt`, just one line change).

Quick way to test this PR would be bring up TGI as follows:

```
text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq

text-generation-launcher \
--huggingface-hub-cache ~/.cache/huggingface/hub/ \
--model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \
--trust-remote-code --port 8080 \
--max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
--quantize awq
```

Please note:
* Th...

c5de7cd8