server/text_generation_server/layers/tensor_parallel.py · 36dd16017c7211b7760d1daa188172bb902e486f · OpenDAS / text-generation-inference

"docs/vscode:/vscode.git/clone" did not exist on "8ba482a6e7064320857d1fb7c332bb08c22c9d31"

Add support for exl2 quantization · 36dd1601

Daniël de Kok authored May 28, 2024

Mostly straightforward, changes to existing code:

* Wrap quantizer parameters in a small wrapper to avoid passing
  around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
  maximum input sequence length to avoid allocating huge
  scratch buffers that OOM.

36dd1601

tensor_parallel.py 8.55 KB

Replace tensor_parallel.py