• Daniël de Kok's avatar
    Add support for Marlin-quantized models · 4594e6fa
    Daniël de Kok authored
    This change adds support for Marlin-quantized models. Marlin is an
    FP16xINT4 matmul kernel, which provides good speedups decoding batches
    of 16-32 tokens. It supports quantized models with symmetric
    quantization, groupsize -1 or 128, and 4-bit.
    
    Tested with:
    
    - Llama 2
    - Llama 3
    - Phi 3
    4594e6fa
launcher.md 16.4 KB