".github/vscode:/vscode.git/clone" did not exist on "71e4268600147a3f1ba1d9f9817ea369ee7493c8"
  • Daniël de Kok's avatar
    Add support for Marlin-quantized models · 4594e6fa
    Daniël de Kok authored
    This change adds support for Marlin-quantized models. Marlin is an
    FP16xINT4 matmul kernel, which provides good speedups decoding batches
    of 16-32 tokens. It supports quantized models with symmetric
    quantization, groupsize -1 or 128, and 4-bit.
    
    Tested with:
    
    - Llama 2
    - Llama 3
    - Phi 3
    4594e6fa
marlin.py 2.42 KB