• Daniël de Kok's avatar
    Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2
    Daniël de Kok authored
    * Improve support for GPUs with capability < 8
    
    - For models that cannot use flashinfer, use flash-attn v1 + paged
      attention for models with a compute capability older than 8.
    - Disable prefix caching when using paged attention.
    - When using flash-attn v1, pass the key/value, rather than the
      cache, since v1 cannot use block tables.
    
    * nix: add flash-attn-v1 to the server environment
    
    * Move disabling prefix caching into the block of exceptions
    
    * Capability as `usize`s
    5b6b74e2
flake.lock 26.5 KB