server/text_generation_server/layers/marlin.py · f433f1f7705ba5d9110532a223d340effef059de · OpenDAS / text-generation-inference

".github/vscode:/vscode.git/clone" did not exist on "71e4268600147a3f1ba1d9f9817ea369ee7493c8"

Add support for Marlin-quantized models · 4594e6fa

Daniël de Kok authored Jun 05, 2024

This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.

Tested with:

- Llama 2
- Llama 3
- Phi 3

4594e6fa

marlin.py 2.42 KB

Replace marlin.py