1. 21 Feb, 2025 1 commit
    • Jesse Gross's avatar
      ml: Abstract attention out of model definitions · f53f4198
      Jesse Gross authored
      
      
      There are two benefits to doing this:
       - Provide a library function that models can use, reducing code for
         each model implementation
       - Enables a single place to drop in optimized implementations of
         attention based on the backend or other factors. One is provided for
         GGML.
      
      On CUDA this improves token generation rate by about 3%. It does not
      have a significant effect on Metal.
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      f53f4198