• Jesse Gross's avatar
    ml: Abstract attention out of model definitions · f53f4198
    Jesse Gross authored
    
    
    There are two benefits to doing this:
     - Provide a library function that models can use, reducing code for
       each model implementation
     - Enables a single place to drop in optimized implementations of
       attention based on the backend or other factors. One is provided for
       GGML.
    
    On CUDA this improves token generation rate by about 3%. It does not
    have a significant effect on Metal.
    Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
    f53f4198
backend.go 5.59 KB