Commits · d4e0da08907f7611e1a2d9bda319bb30cd4ff029 · OpenDAS / ollama

06 Nov, 2025 2 commits
- Remove unnecessary MacOs 13 and lower Patches (#12656) · d4e0da08
  Thomas Stocker authored Nov 07, 2025
```
* Remove unnecessary macos 13 Patch

* Remove unnecessary MacOs Version Guard patch

* rename patchesw

* remove again macos13 patch

* rename files
```
  d4e0da08
- ggml update to b6840 (#12791) · 544b6739
  Daniel Hiltgen authored Nov 06, 2025
  
  544b6739
30 Oct, 2025 1 commit

ggml: Enable op_offload to improve partial offload performance · afaf7ce8

Jesse Gross authored Oct 27, 2025

When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037

afaf7ce8