Commits · 583d37a2f8aee624aa1b77dfc359e32469205c08 · OpenDAS / text-generation-inference

"...schedulers/scheduling_edm_dpmsolver_multistep.py" did not exist on "953c9d14eb209d724f9b7e440fdb3a71ebe4ee1b"

23 Jul, 2024 1 commit

Add support for repacking AWQ weights for GPTQ-Marlin (#2278) · 9935720c

Daniël de Kok authored Jul 23, 2024

* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.

9935720c

14 Jun, 2024 1 commit

Add support for GPTQ Marlin (#2052) · 093a27c5

Daniël de Kok authored Jun 14, 2024

Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.

093a27c5