1. 27 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2
      Daniël de Kok authored
      * Improve support for GPUs with capability < 8
      
      - For models that cannot use flashinfer, use flash-attn v1 + paged
        attention for models with a compute capability older than 8.
      - Disable prefix caching when using paged attention.
      - When using flash-attn v1, pass the key/value, rather than the
        cache, since v1 cannot use block tables.
      
      * nix: add flash-attn-v1 to the server environment
      
      * Move disabling prefix caching into the block of exceptions
      
      * Capability as `usize`s
      5b6b74e2