Commits · 53d2990d9b60fae08437e98141eca5d9e393deaa · OpenDAS / ollama

"vscode:/vscode.git/clone" did not exist on "42d70a9ff62594570efb0b694557e63deecbd675"

27 Feb, 2025 1 commit
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
21 Feb, 2025 1 commit

ml: Abstract attention out of model definitions · f53f4198

Jesse Gross authored Feb 14, 2025



There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

f53f4198

20 Feb, 2025 1 commit

models: Prune unused outputs earlier in the forward pass · 5c5535c0

Jesse Gross authored Feb 18, 2025

Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.

Changing the model definition in this way improves token generation
performance by approximately 8%.

5c5535c0

14 Feb, 2025 2 commits

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03

models: Move model into their own directory · 6945617a

Jesse Gross authored Feb 05, 2025

This allows there to be a file that is a list of models that is
not mixed into the runner code.

6945617a