Commits · 74bd09652d69c77a4bed34b3afda74c87295115b · OpenDAS / ollama

21 Mar, 2025 2 commits

ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
Michael Yang authored Mar 19, 2025

74bd0965

ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758

Jesse Gross authored Mar 20, 2025

This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.

Most of the infrastructure was already there, this extends it to
be available to the backends.

0ff28758

18 Mar, 2025 1 commit

ggml: return error on failure to read tensor data (#9872) · df94175a

Bruce MacDonald authored Mar 18, 2025

When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.

df94175a

17 Mar, 2025 2 commits
- ml/backend/ggml: allocate memory with malloc when loading model (#9822) · 364629b8
  Jeffrey Morgan authored Mar 17, 2025
  
  364629b8
- conditionally enable parallel pipelines · 4561fff3
  Michael Yang authored Mar 14, 2025
  
  4561fff3
13 Mar, 2025 1 commit
- ollama-debug.c: change 'ld' to 'PRIi64' · 30d7a59b
  shane.xb.qian authored Mar 13, 2025
```
* macOS has different definition per info from @mxyng
```
  30d7a59b
12 Mar, 2025 1 commit
- ollama-debug.c: correct mistype · 85ab5520
  shane.xb.qian authored Mar 12, 2025
```
Signed-off-by: shane.xb.qian <shane.qian@foxmail.com>
```
  85ab5520
11 Mar, 2025 8 commits
- use 2d pooling · 63a39406
  Michael Yang authored Mar 11, 2025
  
  63a39406
- fallback to cpu · c5cbe4fc
  Michael Yang authored Mar 10, 2025
  
  c5cbe4fc
- ollama debug tensor · 9e4642e9
  Michael Yang authored Mar 09, 2025
  
  9e4642e9
- duplicate token_embd to output · 6b0486c2
  Michael Yang authored Mar 09, 2025
  
  6b0486c2
- use fast attention · 8934324b
  Michael Yang authored Mar 07, 2025
  
  8934324b
- set non-causal attention · 0df18004
  Michael Yang authored Mar 07, 2025
  
  0df18004
- add gemma vision encoder · 4b037a97
  Michael Yang authored Mar 06, 2025
  
  4b037a97
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
10 Mar, 2025 1 commit

fix: pad tensor item if ge zero · 9926eae0

Michael Yang authored Mar 07, 2025

this produces a nicer output since both positive and negative values
produces the same width

9926eae0

08 Mar, 2025 2 commits

ml: Add support for quantized KV cache · 4100ed7b

Jesse Gross authored Feb 21, 2025

Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.

4100ed7b

ggml-backend: Ensure allocation meet backend requirements · 25f9b152

Jesse Gross authored Mar 07, 2025

Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.

25f9b152

07 Mar, 2025 13 commits
- additional review comments · 98272fbd
  Jesse Gross authored Mar 07, 2025
  
  98272fbd
- ml/backend/ggml: use backend buffer type · b27e8f3f
  Michael Yang authored Mar 05, 2025
```
this ensures the tensor is created on the right buffer type for backends
such as cpu
```
  b27e8f3f
- comments · 45df786f
  Michael Yang authored Mar 04, 2025
  
  45df786f
- ml/backend/ggml: clean up · daaf42e4
  Michael Yang authored Feb 28, 2025
  
  daaf42e4
- ml/backend/ggml: offload vision to cpu · 2dc60d46
  Michael Yang authored Feb 27, 2025
```
temporary until tensor loading can accurately account for vision models
```
  2dc60d46
- ml/backend/ggml: handle tensor split · b5312f30
  Michael Yang authored Feb 26, 2025
  
  b5312f30
- ml/backend/ggml: handle user specified cpu offloading · 26c2e0bd
  Michael Yang authored Feb 26, 2025
  
  26c2e0bd
- ml/backend/ggml: set cpu n_threads · bf920883
  Michael Yang authored Feb 26, 2025
  
  bf920883
- ml/backend/ggml: create tensor on specific backend · 7bae7fa5
  Michael Yang authored Feb 25, 2025
```
some tensors should be created on specific backends to reduce number of
copies and improve performance
```
  7bae7fa5
- kvcache: create cache ctx per layer · 764e199d
  Michael Yang authored Feb 25, 2025
```
each cache layer creates and maintains its own context instead of using
a large context for all layers
```
  764e199d
- model: load non-repeated tensors into multiple backends · bfce55db
  Michael Yang authored Feb 24, 2025
```
some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
```
  bfce55db
- ml/backend/ggml: update model loading for hybrid/multi backends · bab6f34d
  Michael Yang authored Feb 19, 2025
```
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
```
  bab6f34d
- llama: fix kv loading on snowflake-arctic-embed models (#9536) · 4289c743
  Jeffrey Morgan authored Mar 07, 2025
  
  4289c743
04 Mar, 2025 1 commit

ml/backend/ggml: consolidate system info logging · 05a01fde

Michael Yang authored Feb 28, 2025

- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name

05a01fde

03 Mar, 2025 1 commit

fix: own lib/ollama directory · ba7d3124

Michael Yang authored Mar 03, 2025

expand backend loading error handling to catch more problems and log
them instead of panicing

ba7d3124

02 Mar, 2025 4 commits

ml: Enable support for flash attention · 21aa666a

Jesse Gross authored Feb 25, 2025

The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.

21aa666a

ml: Empty tensor constructor for tensors · ee141cc8

Jesse Gross authored Feb 28, 2025

In cases where we allocate a tensor and then fully overwrite it with
copied data, it is wasteful to first zero out the memory.

ee141cc8

ggml-backend: Store parent backend as part of tensor · 55e5776c

Jesse Gross authored Feb 27, 2025

It can be important for a tensor to know what backend it came from -
for example, to know if flash attention is enabled.

55e5776c

attention: Remove unnecessary contiguous operations · 854a9195

Jesse Gross authored Feb 22, 2025

Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.

854a9195

27 Feb, 2025 3 commits
- ml: update Context.Forward interface · 3e8b8a19
  Michael Yang authored Feb 21, 2025
```
update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute
```
  3e8b8a19
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
- ml/backend/ggml: fix debug logging · a59f6652
  Michael Yang authored Feb 26, 2025
  
  a59f6652