Commits · 1f371ea92f7ebe4edd208b6732753473b2c4d0cd · OpenDAS / ollama

22 May, 2025 2 commits

ml: Panic rather than return error on tensor allocation failure · 1f371ea9

Jesse Gross authored May 19, 2025

FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.

1f371ea9

ollamarunner: Memory usage reporting · 73d6a82c

Jesse Gross authored Apr 17, 2025

This provides granular information about the backend memory allocations
required by the runner:
 - Per backend
 - Per layer
 - Weights, cache and graph
 - Allocation status

This can be used for debugging and validating memory estimates.

73d6a82c

21 May, 2025 1 commit
- feat: qwen3 dense and sparse models (#10708) · e0ed984c
  Michael Yang authored May 21, 2025
```
* feat: qwen3 dense
* feat: qwen3moe
* fix llama4 moe
```
  e0ed984c
20 May, 2025 1 commit
- ml: add more rope options (#10775) · 9ed8bf14
  Michael Yang authored May 20, 2025
  
  9ed8bf14
19 May, 2025 1 commit

ggml: Seperate tensor load from backend creation · 94ab428e

Jesse Gross authored Apr 17, 2025

Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.

94ab428e

15 May, 2025 1 commit
- fix pixel values padding (#10718) · ef202789
  Michael Yang authored May 15, 2025
```
* panic if trying to pad 4d

* fix pixel values padding
```
  ef202789
14 May, 2025 2 commits
- model: add Qwen2.5-VL support (#10385) · 0aa8b371
  Bruce MacDonald authored May 13, 2025
  
  0aa8b371
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
12 May, 2025 2 commits
- llama: update to commit de4c07f93 (#10655) · 0cefd46f
  Jeffrey Morgan authored May 12, 2025
  
  0cefd46f
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
06 May, 2025 1 commit

Move quantization to new backend (#10363) · 42481045

Daniel Hiltgen authored May 06, 2025

* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.

42481045

02 May, 2025 1 commit

ggml: Fix race that resulted in "context canceled" when loading · a6ef73f4

Jesse Gross authored May 01, 2025

Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.

To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.

The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.

a6ef73f4

25 Apr, 2025 1 commit
- llama4 · f0c66e6d
  Michael Yang authored Apr 03, 2025
  
  f0c66e6d
18 Apr, 2025 1 commit
- arange · 40b8fdbd
  Michael Yang authored Apr 03, 2025
  
  40b8fdbd
11 Apr, 2025 4 commits

ggml: Fix memory leak on input tensors · f50d6912

Jesse Gross authored Apr 08, 2025

For every forward pass through the model, we need to allocate input
tensors: tokens, images, positions, outputs and masks. These get
allocated in system memory.

However, when we close the context that the tensors were allocated
through, the metadata gets freed but the actual backend memory does
not. This results in a significant memory leak.

This makes it so that all the memory allocated through a context
gets freed when it is closed.

Fixes #10040

f50d6912

ggml: Don't allocate CPU buffers as CUDA Host buffers · 34c3b68f

Jesse Gross authored Apr 09, 2025

Allocating (and in particular, freeing) memory from CUDA host buffers
is expensive and can cause a significant performance hit if we do
it for every token. Using normal system memory avoids this issue
and also gives the OS more flexibility to manage it.

There is no performance impact from this patch directly (either
positive or negative) but it makes a difference once we start
freeing memory correctly.

34c3b68f

ggml: Use pointer receivers for Context · f33ccd5d

Jesse Gross authored Mar 11, 2025

Context is currently mixed between pointer and value receivers. Change
this to be all pointer receivers so don't have to reason about whether
the things we are updating in the struct will be retained.

f33ccd5d

ggml: Log filesystem errors · bc108b9a

Jesse Gross authored Apr 10, 2025

Sometimes loading the GGUF file fails with:
panic: context canceled

This is probably a filesystem error but it doesn't provide any
information about what happened.

bc108b9a

08 Apr, 2025 2 commits

ollamarunner: Preallocate worst case graph at startup · dbb149e6

Jesse Gross authored Apr 03, 2025

Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.

dbb149e6

ggml: Check for OOM and return as Go errors · a807985e

Jesse Gross authored Apr 04, 2025

If there is a CUDA OOM, we currently don't check the return value
and will evetually segfault. This checks for the problem and generates
a Go error. At the moment, this will still result in a panic but having
the error is the first step to being able to handle it more gracefully.

a807985e

05 Apr, 2025 1 commit

ml/backend/ggml: create a new file descriptor for tensor (#10133) · 0f3f9e35

Daniel Hipke authored Apr 04, 2025

improves model loading times on network-based filesystems
such as GCS fuse by creating a dedicated file descriptor for each
section of the file being read, reducing seeking

0f3f9e35

03 Apr, 2025 2 commits

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983

fs: move ml.Config to fs package · 3b96a936
Michael Yang authored Mar 18, 2025

3b96a936

27 Mar, 2025 1 commit

ml: Remove Output from Context interface · 01aa7887

Jesse Gross authored Mar 27, 2025

Model implementations should use Input for all of their tensors
supplied to the model. This includes tensors that relate to the
outputs, which is confusing since there is also an Output funciton.

Since Output is only used internally in GGML and not used by any
model implementations, we can remove it from the interface to
reduce confusion.

01aa7887

21 Mar, 2025 1 commit
- ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
  Michael Yang authored Mar 19, 2025
  
  74bd0965
18 Mar, 2025 1 commit

ggml: return error on failure to read tensor data (#9872) · df94175a

Bruce MacDonald authored Mar 18, 2025

When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.

df94175a

17 Mar, 2025 2 commits
- ml/backend/ggml: allocate memory with malloc when loading model (#9822) · 364629b8
  Jeffrey Morgan authored Mar 17, 2025
  
  364629b8
- conditionally enable parallel pipelines · 4561fff3
  Michael Yang authored Mar 14, 2025
  
  4561fff3
11 Mar, 2025 8 commits
- use 2d pooling · 63a39406
  Michael Yang authored Mar 11, 2025
  
  63a39406
- fallback to cpu · c5cbe4fc
  Michael Yang authored Mar 10, 2025
  
  c5cbe4fc
- ollama debug tensor · 9e4642e9
  Michael Yang authored Mar 09, 2025
  
  9e4642e9
- duplicate token_embd to output · 6b0486c2
  Michael Yang authored Mar 09, 2025
  
  6b0486c2
- use fast attention · 8934324b
  Michael Yang authored Mar 07, 2025
  
  8934324b
- set non-causal attention · 0df18004
  Michael Yang authored Mar 07, 2025
  
  0df18004
- add gemma vision encoder · 4b037a97
  Michael Yang authored Mar 06, 2025
  
  4b037a97
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
08 Mar, 2025 2 commits

ml: Add support for quantized KV cache · 4100ed7b

Jesse Gross authored Feb 21, 2025

Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.

4100ed7b

ggml-backend: Ensure allocation meet backend requirements · 25f9b152

Jesse Gross authored Mar 07, 2025

Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.

25f9b152

07 Mar, 2025 2 commits
- additional review comments · 98272fbd
  Jesse Gross authored Mar 07, 2025
  
  98272fbd
- ml/backend/ggml: use backend buffer type · b27e8f3f
  Michael Yang authored Mar 05, 2025
```
this ensures the tensor is created on the right buffer type for backends
such as cpu
```
  b27e8f3f