Commits · 21aa666a1eeff87d3fc6f4f8a43167e1fdd0d3ad · OpenDAS / ollama

02 Mar, 2025 1 commit

ml: Enable support for flash attention · 21aa666a

Jesse Gross authored Feb 25, 2025

The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.

21aa666a

28 Feb, 2025 2 commits

runner: defer context cancel · 31e472ba
Michael Yang authored Feb 28, 2025
```
defer the cancel to guarantee it runs
```
31e472ba

runner: default to greedy sampler for performance (#9407) · 0c1041ad

Bruce MacDonald authored Feb 27, 2025

As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.

0c1041ad

27 Feb, 2025 1 commit
- runner: simplify tensor split parsing · d6af13ef
  Michael Yang authored Feb 26, 2025
  
  d6af13ef
25 Feb, 2025 1 commit
- sample: add sampling package for new engine (#8410) · 0b7e1676
  Parth Sareen authored Feb 24, 2025
  
  0b7e1676
20 Feb, 2025 1 commit

ollamarunner: Pass runner performance parameters to backends · bd6a7d5e

Jesse Gross authored Feb 20, 2025

Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.

bd6a7d5e

14 Feb, 2025 2 commits

Wire up system info log for new engine (#9123) · df2680b4
Daniel Hiltgen authored Feb 14, 2025

df2680b4

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03