Commits · 1a2feb2a970c8331e1d34f68877190c169999c44 · OpenDAS / ollama

10 Oct, 2025 1 commit

Michael Yang authored Oct 10, 2025

hardErrCh will deadlock since forwardBatch is blocked on
computeStartedCh which never gets sent. since the response to
hardErrCh is to panic, just panic instead

1a2feb2a

09 Oct, 2025 3 commits
- ollamarunner: measure only active time · 967a82f5
  Michael Yang authored Sep 29, 2025
  
  967a82f5
- Revert "add truncate and shift parameters (#12519)" (#12545) · 7d965258
  Jeffrey Morgan authored Oct 08, 2025
```
This reverts commit 6a62b894.
```
  7d965258
- add truncate and shift parameters (#12519) · 6a62b894
  Jeffrey Morgan authored Oct 08, 2025
  
  6a62b894
01 Oct, 2025 1 commit

Use runners for GPU discovery (#12090) · bc8909fb

Daniel Hiltgen authored Oct 01, 2025

This revamps how we discover GPUs in the system by leveraging the Ollama
runner. This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs. Now the runner does that implicitly based on the actual
device list. In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.

bc8909fb

17 Sep, 2025 1 commit
- refactor: use the built-in max/min to simplify the code (#12280) · 05d53457
  russcoss authored Sep 16, 2025
```
Signed-off-by: russcoss <russcoss@outlook.com>
```
  05d53457
16 Sep, 2025 1 commit
- embed: cleanup (#12299) · c253433d
  Michael Yang authored Sep 16, 2025
```
* cleanup

* use pooling.TypeNone

* pooling test
```
  c253433d
15 Sep, 2025 1 commit
- batch: use tensors for outputs (#12185) · 6f711714
  Michael Yang authored Sep 15, 2025
```
this cleans up the model interface slightly without too much impact in
other areas
```
  6f711714
12 Sep, 2025 2 commits
- Revert "runner: move harmony to runner (#12052)" · 92b96d54
  jmorganca authored Sep 12, 2025
```
This reverts commit 1a558f98.
```
  92b96d54
- Revert "runner: simplify parser entrypoints in runner (#12233)" · 9d56e63d
  jmorganca authored Sep 12, 2025
```
This reverts commit 8d6fffae.
```
  9d56e63d
11 Sep, 2025 1 commit

ollamarunner: Suppress stack trace during memory allocation · 26214125

Jesse Gross authored Sep 11, 2025

Allocation failures can be a normal part of new memory estimates, so
we shouldn't print a stack trace in this case.

26214125

10 Sep, 2025 1 commit
- runner: simplify parser entrypoints in runner (#12233) · 8d6fffae
  Parth Sareen authored Sep 10, 2025
  
  8d6fffae
09 Sep, 2025 1 commit

llm: Clamp batch size to context size · e119783e

Jesse Gross authored Sep 08, 2025

The context must always be able to store the current batch, so
if the user requests a small context then we should also shrink
the batch to match. This also fixes the TestLongInputContext
test on the new engine. (The old engine already has this behavior.)

e119783e

08 Sep, 2025 2 commits
- runner: move harmony to runner (#12052) · 1a558f98
  Parth Sareen authored Sep 08, 2025
  
  1a558f98
- fix: nil pointer dereference if cache is nil (#12215) · 9714e38d
  Michael Yang authored Sep 08, 2025
  
  9714e38d
04 Sep, 2025 2 commits
- embedding gemma model (#12181) · 5994e8e8
  Michael Yang authored Sep 04, 2025
```
* ollama: add embeddings
```
  5994e8e8
- more logutil.Trace (#12177) · b3e61207
  Michael Yang authored Sep 03, 2025
  
  b3e61207
29 Aug, 2025 1 commit

perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd

Daniel Hiltgen authored Aug 29, 2025

* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.

517807cd

22 Aug, 2025 1 commit
- chore: remove redundant words in comment (#12028) · 109d4fc3
  zoupingshi authored Aug 23, 2025
```
Signed-off-by: zoupingshi <hangfachang@outlook.com>
```
  109d4fc3
14 Aug, 2025 1 commit

llm: New memory management · d5a0d8d9

Jesse Gross authored May 29, 2025

This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).

It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.

d5a0d8d9

08 Aug, 2025 1 commit

ggml: Support closing backends · 756c78cf

Jesse Gross authored Apr 17, 2025

In order to iteratively find the best memory allocation, we need to
be able to free backend memory so we can try again.

756c78cf

22 May, 2025 2 commits

ml: Panic rather than return error on tensor allocation failure · 1f371ea9

Jesse Gross authored May 19, 2025

FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.

1f371ea9

ollamarunner: Memory usage reporting · 73d6a82c

Jesse Gross authored Apr 17, 2025

This provides granular information about the backend memory allocations
required by the runner:
 - Per backend
 - Per layer
 - Weights, cache and graph
 - Allocation status

This can be used for debugging and validating memory estimates.

73d6a82c

19 May, 2025 1 commit

ggml: Seperate tensor load from backend creation · 94ab428e

Jesse Gross authored Apr 17, 2025

Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.

94ab428e

15 May, 2025 3 commits

ollamarunner: Multi-modal worst case graph · fe623c2c

Jesse Gross authored Apr 07, 2025

We currently preallocate compute graph memory for the worst case
batch of text tokens. This adds support for doing the same for
images.

Note that image models are more complicated than text models in
how they process their inputs so there may be cases where this
approach isn't completely generic for all models. It covers all
currently supported models though.

fe623c2c

ollamarunner: Separate text and multimodal graphs · 3c14461d

Jesse Gross authored May 05, 2025

For some multimodal models (such as gemma3), we create a single
graph that generates the image embedding and then use this in the
text model. The embedding tensor is completely opaque to the runner.

However, this doesn't work if we need to use the embedding in multiple
batches. This can arise if the embedding is larger than the batch size.
In these cases (as with llama4), we would like to create views that
are more appropriately sized. However, if we do this then the original
source tensor is used in multiple graphs, which isn't allowed. To
avoid that problem, models with this pattern compute the embedding
tensor on first use and recreate the individual views. There is no
longer a single vision and text graph.

This codifies the pattern of separating vision and text graphs. The
logic of computing tensors on demand is moved to the runner, so models
no longer have to worry about this. It also gives the runner visibility
into the multimodal tensors, which is important for memory management.

3c14461d

ollamarunner: Base cached tokens on current prompt · 499ae731

Jesse Gross authored May 09, 2025

When we restore a sequence from the cache, we split the prompt into
the already used tokens (stored in the cache) and new tokens that
need to be processed. Currently, the references to the used tokens
are coming from the stored previous sequence.

However, even though we know that the used tokens are semantically
equivalent to the prefix of the prompt, tokens can contain pointers
which are no longer valid. As a result, it is better to get the
used tokens from the prompt, which has currently valid pointers.

This doesn't currently have any impact because it isn't possible
to reuse the pointers (which are tensors) anyways. However, it
becomes an issue once we can.

499ae731

12 May, 2025 1 commit
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
08 May, 2025 1 commit

ollamarunner: Use correct constant to remove cache entries · 3d9498a4

Jesse Gross authored May 07, 2025

The correct constant to remove all entries to the end of the sequence
for the Ollama engine is math.MaxInt32. -1 is used by the old engine.

The impact of this is currently minimal because it would only occur
in situations that are not supported by the implemented models or
rarely used options.

3d9498a4

05 May, 2025 1 commit

api: remove unused or unsupported api options (#10574) · 3b2d2c83

Jeffrey Morgan authored May 05, 2025

Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options

3b2d2c83

02 May, 2025 1 commit

ollamarunner: Re-enable worst case graph preallocation. · c2f5d666

Jesse Gross authored May 02, 2025

Worst case graph preallocation was disabled by a27462b7
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.

c2f5d666

01 May, 2025 1 commit

ollamarunner: Fix memory leak when processing images · 8e8f2c6d

Jesse Gross authored May 01, 2025

The context (and therefore associated input tensors) was not being
properly closed when images were being processed. We were trying to
close them but in reality we were closing over an empty list, preventing
anything from actually being freed.

Fixes #10434

8e8f2c6d

29 Apr, 2025 1 commit

ollamarunner: Temporarily disable worst case graph preallocation · a27462b7

Jesse Gross authored Apr 29, 2025

When we later have a large batch running purely on a CPU, this
results the error:
GGML_ASSERT(talloc->buffer_id >= 0)

Disabling this means that we will incrementally reallocate memory
as the graph grows.

Fixes #10410

a27462b7

24 Apr, 2025 1 commit
- llama: remove model loading for grammar (#10096) · a53d744b
  Parth Sareen authored Apr 24, 2025
  
  a53d744b
08 Apr, 2025 1 commit

ollamarunner: Preallocate worst case graph at startup · dbb149e6

Jesse Gross authored Apr 03, 2025

Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.

dbb149e6

03 Apr, 2025 1 commit

llm: set done reason at server level (#9830) · e53b3cbd

Bruce MacDonald authored Apr 03, 2025

No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.

e53b3cbd

02 Apr, 2025 2 commits

kvcache: Add check for values that fall out of sliding window cache · b4297006

jmorganca authored Mar 30, 2025

The sliding window cache trims entries that are outside the window for
the latest token. This works when we are extending the cache, such as
when the conversation continues. However, if we have a partial overlap
in conversation (including the BOS tokens), then we resume from a past
point in the conversation and the needed tokens are no longer stored
in memory. This verifies that the new window overlaps with the old one
before reusing the cache.
Co-authored-by: Jesse Gross <jesse@ollama.com>

b4297006

ollamarunner: Don't truncate a SameBatch · 493385eb

Jesse Gross authored Apr 01, 2025

When truncating inputs to the the context window at the beginning of
a sequence, we remove the minimum amount possible. However, this
may cause us to truncate to the middle of a set of inputs that
the model specified should not be split up. To avoid this, we
need to remove the rest of the partial batch.

493385eb

31 Mar, 2025 2 commits

runner: clear cache when shift is not possible (#9433) · 66b25392

Bruce MacDonald authored Mar 31, 2025

Clear KV cache when shift operation is not supported by model.
Added KvCacheCanShift() check to handle models that can't perform cache shifts,
falling back to full cache clear while preserving logical token history to
maintain expected behavior when context window fills up.

66b25392

runner: Release semaphore and improve error messages on failures · b2a46529

Jesse Gross authored Mar 14, 2025

If we have an error after creating a new sequence but before
finding a slot for it, we return without releasing the semaphore.
This reduces our parallel sequences and eventually leads to deadlock.

In practice this should never happen because once we have acquired
the semaphore, we should always be able to find a slot. However, the
code is clearly not correct.

b2a46529