Commits · 1c093e97af54b3d78d54426ea5d05ef8c4e83ca0 · OpenDAS / ollama

23 Oct, 2025 1 commit

kvcache: Remove special case for reservation mask · 1c093e97

Jesse Gross authored Oct 22, 2025

We currently short circuit generation of the cache mask and just
generate an empty tensor of the correct size. However, in some
cases, this can also skip a cast operation. This can result in the
worst case graph being not fully worst case.

We don't actually need the fast path for mask generation, so it's
better to just use the normal code path.

1c093e97

08 Oct, 2025 1 commit

kvcache: Clean up sliding window state with independent batches · 1fc35f12

Jesse Gross authored Oct 06, 2025

Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that
are out of the cache's window each time we start a new forward pass.

The cache storage needs to handle the window size for each sequence
plus the batch size, since the batch needs to attend to the full
window size. This means that we have greater than a window size
stored while processing the batch.

When the next batch comes, we are currently only looking at the
sequences in the incoming batch to slide the window forward.
However, we also need to clean up the other sequences that might
be occupying space in the batch processing buffer to ensure each
sequence is only using its window size of storage. Failure to do
this can result in "no kv cache slot found" errors.

Fixes: #10127

1fc35f12

19 Aug, 2025 1 commit

kvcache: Use Cast instead of Copy for flash attention masks · 05ccb17c

Jesse Gross authored Aug 19, 2025

Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.

05ccb17c

04 Aug, 2025 1 commit

kvcache: Log contents of cache when unable to find a slot · 0d38b665

Jesse Gross authored Aug 04, 2025

There is a bug when using sliding window attention where we run
out of KV cache slots. This is likely due to not correctly removing
all of the entries as they slide out of range. This adds additional
logging when this occurs to track down the source.

Bug #10127

0d38b665

31 Jul, 2025 1 commit

kvcache: Enable SWA to retain additional entries · 4183bb05

Jesse Gross authored Jul 30, 2025

Models that use sliding window attention can only resume a sequence
from the cache if it falls within the saved windows. This works well
if the next message picks up where the old one left off. However, it
generally prevents a partial prefix match unless the entire conversation
falls within the sliding window.

This can be a problem with reasoning models where the traces are
supposed to be removed from future messages, forcing the entire
history to be re-evaluated.

This change allows models to specify that a larger amount of the
history be retained in memory, to allow more partial resumption.
It still respects the window that the model was trained on for
token generation.

4183bb05

29 Jul, 2025 1 commit

kvcache: Don't shift empty batches · c116a752

Jesse Gross authored Jul 28, 2025

When we context shift, we delete half the context and apply RoPE
with an offset to the other half. We used to RoPE across the entire
context in a single pass with a zero offset for the deleted
section. With the change to shifting in batches, we can skip any
batches where all of the offsets would be zero. This typically
reduces the number of operations by half.

c116a752

25 Jul, 2025 1 commit

kvcache: Group shift operations into batches · 764be748

Jesse Gross authored Jul 25, 2025

Currently, when we need to do a shift on the cache, it is one
RoPE operation on the entire size of the cache (per layer). In
some cases, this can create a compute graph that is larger than
the forward pass since the forward pass is working in batches.
Since we don't consider shifting in our memory estimates, it's
possible for this to cause a crash if we run out of memory.

By limiting the size of the RoPE calls to batch size chunks, we
ensure that the shift will never exceed the size of the forward
pass, since the forward pass will also contain a RoPE of the same
size. This does not have a sigificant impact on performance since
RoPE is a math operation that is mostly proportional to the size
of its inputs.

In theory defrag could have the same issue since it also creates a
compute graph outside of the forward pass, however, since it is
only copies, it does not require any working space.

764be748

27 May, 2025 1 commit

kvcache: Skip computing causal mask for worst case graph reservation · ea790031

Jesse Gross authored May 27, 2025

Computing an attention mask for a large context and max batch is
expensive - over 100ms. Models like Gemma3 that have multiple types
of caches and custom attention masks need to do this 4 times, so this
adds approximately 500ms to startup time when using 128k context

When we are reserving the worst case graph, we don't need the mask,
only its shape, so we can skip this.

ea790031

22 May, 2025 1 commit

ml: Panic rather than return error on tensor allocation failure · 1f371ea9

Jesse Gross authored May 19, 2025

FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.

1f371ea9

01 May, 2025 1 commit

kvcache: Log batch size if we can't find a slot · 074bac84

Jesse Gross authored May 01, 2025

In some cases, we can't find a cache slot when using sliding window
attention. It would be helpful in this (and other cases) to know what
the batch size is.

Bug #10127

074bac84

25 Apr, 2025 1 commit
- chunked attention · 8bf11b84
  Michael Yang authored Apr 10, 2025
  
  8bf11b84
08 Apr, 2025 1 commit

ollamarunner: Preallocate worst case graph at startup · dbb149e6

Jesse Gross authored Apr 03, 2025

Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.

dbb149e6

02 Apr, 2025 1 commit

kvcache: Add check for values that fall out of sliding window cache · b4297006

jmorganca authored Mar 30, 2025

The sliding window cache trims entries that are outside the window for
the latest token. This works when we are extending the cache, such as
when the conversation continues. However, if we have a partial overlap
in conversation (including the BOS tokens), then we resume from a past
point in the conversation and the needed tokens are no longer stored
in memory. This verifies that the new window overlaps with the old one
before reusing the cache.
Co-authored-by: Jesse Gross <jesse@ollama.com>

b4297006

26 Mar, 2025 1 commit

kvcache: Sliding window cache only needs a single batch total · 1feff619

Jesse Gross authored Mar 24, 2025

When computing the size of the cache for sliding window attention,
we don't need to multiple the batch size by the number of parallel
sequences - the batch size is constant.

This also simplifies the check for whether to allocate the cache
size based on capacity or window size as the batch size is already
incorporated into the capacity when handled by the runner.

1feff619

21 Mar, 2025 3 commits

kvcache: Optimize sliding window attention · 2d6eac90

Jesse Gross authored Mar 18, 2025

Currently sliding window attention allocates and uses the full
context size and just masks out any tokens that are outside of the
window. However, we really only need (roughly) the sliding window
size.

At large context sizes this improves two things:
 - Memory allocated - since the fully context size is allocated up front,
   memory requirements drop substantially. On Gemma3:4b with a 32k
   context window, total memory usage (including weights and non-sliding
   layers) drops from ~20GB to ~8GB.
 - Computation - ranges that are completely outside of the sliding
   window are now removed from the tensors that are returned from the
   cache rather than simply being masked out. This results in more
   efficient processing, scaling with the size of the context that
   has actually been used.

Notable, this does not update the scheduler for any model to be aware of
the smaller memory requirements. This is difficult for Gemma3 because
the layers are heterogeneous between sliding and non-sliding attention.
As a result, while actual memory consumption will be reduced, the
scheduler will over-estimate the requirements of the model. This means
that splitting between GPUs or GPUs and CPUs will still be suboptimal.

Bug #9730

2d6eac90

kvcache: Pass granular cache size into implementations · 3ed7ad3a

Jesse Gross authored Mar 18, 2025

Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.

Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.

3ed7ad3a

kvcache: Account for source tensors in defrag operation count · d3e9ca3e

Jesse Gross authored Mar 20, 2025

Defragging the KV cache can generate a lot of operations, so we
need to be careful that we don't overflow the number that the graph
can support. We currently account for all of the nodes that we add
to the graph for each move but we also need to include the original
cache tensors as well.

Fixes #9904

d3e9ca3e

20 Mar, 2025 1 commit
- input: Rename Options to Batch · 0c220935
  Jesse Gross authored Mar 19, 2025
```
Options is no longer very descriptive of this struct.
```
  0c220935
11 Mar, 2025 2 commits
- Disable causal attention based on batch index · a8e83a76
  Jesse Gross authored Mar 10, 2025
```
Currently we are using positions, which are relative to a
sequence and may not be unique.
```
  a8e83a76
- use non-causal mask only for image positions · e9527893
  Michael Yang authored Mar 10, 2025
  
  e9527893
10 Mar, 2025 1 commit

model: Update encoder cache to use multimodal input processing handler · a1cda80b

Jesse Gross authored Mar 08, 2025

The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.

Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.

Most of this is simply moving the input data structures to a new
package to avoid import cycles.

a1cda80b

08 Mar, 2025 2 commits
- kvcache: Set context for shift offsets · f52b2615
  Jesse Gross authored Mar 07, 2025
  
  f52b2615
- kvcache: Support non-causal attention · 6da8b6a8
  Jesse Gross authored Mar 07, 2025
```
Models can disable causality for all or part of their processing
while continuing to store data in the KV cache.
```
  6da8b6a8
07 Mar, 2025 2 commits

ml/backend/ggml: create tensor on specific backend · 7bae7fa5

Michael Yang authored Feb 25, 2025

some tensors should be created on specific backends to reduce number of
copies and improve performance

7bae7fa5

kvcache: create cache ctx per layer · 764e199d

Michael Yang authored Feb 25, 2025

each cache layer creates and maintains its own context instead of using
a large context for all layers

764e199d

02 Mar, 2025 2 commits

ml: Enable support for flash attention · 21aa666a

Jesse Gross authored Feb 25, 2025

The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.

21aa666a

attention: Remove unnecessary contiguous operations · 854a9195

Jesse Gross authored Feb 22, 2025

Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.

854a9195

27 Feb, 2025 1 commit

ml: update Context.Forward interface · 3e8b8a19

Michael Yang authored Feb 21, 2025

update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute

3e8b8a19

14 Feb, 2025 1 commit

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03