Commits · f53f4198c36d0a943de598ad91a20baa9481c5c5 · OpenDAS / ollama

21 Feb, 2025 1 commit

ml: Abstract attention out of model definitions · f53f4198

Jesse Gross authored Feb 14, 2025



There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

f53f4198