server/text_generation_server/layers/linear.py · 36dd16017c7211b7760d1daa188172bb902e486f · OpenDAS / text-generation-inference

Add support for exl2 quantization · 36dd1601

Daniël de Kok authored May 28, 2024

Mostly straightforward, changes to existing code:

* Wrap quantizer parameters in a small wrapper to avoid passing
  around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
  maximum input sequence length to avoid allocating huge
  scratch buffers that OOM.

36dd1601

linear.py 7.37 KB

Replace linear.py