1. 23 Oct, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Remove special case for reservation mask · 1c093e97
      Jesse Gross authored
      We currently short circuit generation of the cache mask and just
      generate an empty tensor of the correct size. However, in some
      cases, this can also skip a cast operation. This can result in the
      worst case graph being not fully worst case.
      
      We don't actually need the fast path for mask generation, so it's
      better to just use the normal code path.
      1c093e97
  2. 08 Oct, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Clean up sliding window state with independent batches · 1fc35f12
      Jesse Gross authored
      Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that
      are out of the cache's window each time we start a new forward pass.
      
      The cache storage needs to handle the window size for each sequence
      plus the batch size, since the batch needs to attend to the full
      window size. This means that we have greater than a window size
      stored while processing the batch.
      
      When the next batch comes, we are currently only looking at the
      sequences in the incoming batch to slide the window forward.
      However, we also need to clean up the other sequences that might
      be occupying space in the batch processing buffer to ensure each
      sequence is only using its window size of storage. Failure to do
      this can result in "no kv cache slot found" errors.
      
      Fixes: #10127
      1fc35f12
  3. 19 Aug, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Use Cast instead of Copy for flash attention masks · 05ccb17c
      Jesse Gross authored
      Flash attention kernels require the mask of the KV cache be a F16
      rather than an F32. We can use the GGML operation ggml_cast to do
      this rather than doing it ourselves, which allows reuse of a
      preallocated buffer in the graph rather than allocating a new one
      for each batch. This improves token generation performance with
      flash attention by 10-30% (with gpt-oss). This also makes performance
      with flash attention better than without it, as expected.
      05ccb17c
  4. 04 Aug, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Log contents of cache when unable to find a slot · 0d38b665
      Jesse Gross authored
      There is a bug when using sliding window attention where we run
      out of KV cache slots. This is likely due to not correctly removing
      all of the entries as they slide out of range. This adds additional
      logging when this occurs to track down the source.
      
      Bug #10127
      0d38b665
  5. 31 Jul, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Enable SWA to retain additional entries · 4183bb05
      Jesse Gross authored
      Models that use sliding window attention can only resume a sequence
      from the cache if it falls within the saved windows. This works well
      if the next message picks up where the old one left off. However, it
      generally prevents a partial prefix match unless the entire conversation
      falls within the sliding window.
      
      This can be a problem with reasoning models where the traces are
      supposed to be removed from future messages, forcing the entire
      history to be re-evaluated.
      
      This change allows models to specify that a larger amount of the
      history be retained in memory, to allow more partial resumption.
      It still respects the window that the model was trained on for
      token generation.
      4183bb05
  6. 29 Jul, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Don't shift empty batches · c116a752
      Jesse Gross authored
      When we context shift, we delete half the context and apply RoPE
      with an offset to the other half. We used to RoPE across the entire
      context in a single pass with a zero offset for the deleted
      section. With the change to shifting in batches, we can skip any
      batches where all of the offsets would be zero. This typically
      reduces the number of operations by half.
      c116a752
  7. 25 Jul, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Group shift operations into batches · 764be748
      Jesse Gross authored
      Currently, when we need to do a shift on the cache, it is one
      RoPE operation on the entire size of the cache (per layer). In
      some cases, this can create a compute graph that is larger than
      the forward pass since the forward pass is working in batches.
      Since we don't consider shifting in our memory estimates, it's
      possible for this to cause a crash if we run out of memory.
      
      By limiting the size of the RoPE calls to batch size chunks, we
      ensure that the shift will never exceed the size of the forward
      pass, since the forward pass will also contain a RoPE of the same
      size. This does not have a sigificant impact on performance since
      RoPE is a math operation that is mostly proportional to the size
      of its inputs.
      
      In theory defrag could have the same issue since it also creates a
      compute graph outside of the forward pass, however, since it is
      only copies, it does not require any working space.
      764be748
  8. 27 May, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Skip computing causal mask for worst case graph reservation · ea790031
      Jesse Gross authored
      Computing an attention mask for a large context and max batch is
      expensive - over 100ms. Models like Gemma3 that have multiple types
      of caches and custom attention masks need to do this 4 times, so this
      adds approximately 500ms to startup time when using 128k context
      
      When we are reserving the worst case graph, we don't need the mask,
      only its shape, so we can skip this.
      ea790031
  9. 22 May, 2025 1 commit
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
  10. 01 May, 2025 1 commit
  11. 25 Apr, 2025 1 commit
  12. 08 Apr, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Preallocate worst case graph at startup · dbb149e6
      Jesse Gross authored
      Currently, the KV cache and graph are lazily allocated as needed.
      The cache is fully allocated on first use of the corresponding
      layer whereas the graph grows with the size of the context.
      
      This can be an issue if another application allocates more VRAM
      after we do our calculations - Ollama will crash in the middle of
      inference. If we instead allocate the maximum needed memory at
      startup of the runner, we will either succeed or fail at that point
      rather than at some surprising time in the future.
      
      Currently, this only generates a worst case batch for text, which
      means that vision models may get a partial allocation and continue
      to lazily allocate the rest.
      dbb149e6
  13. 02 Apr, 2025 1 commit
    • jmorganca's avatar
      kvcache: Add check for values that fall out of sliding window cache · b4297006
      jmorganca authored
      
      
      The sliding window cache trims entries that are outside the window for
      the latest token. This works when we are extending the cache, such as
      when the conversation continues. However, if we have a partial overlap
      in conversation (including the BOS tokens), then we resume from a past
      point in the conversation and the needed tokens are no longer stored
      in memory. This verifies that the new window overlaps with the old one
      before reusing the cache.
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      b4297006
  14. 26 Mar, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Sliding window cache only needs a single batch total · 1feff619
      Jesse Gross authored
      When computing the size of the cache for sliding window attention,
      we don't need to multiple the batch size by the number of parallel
      sequences - the batch size is constant.
      
      This also simplifies the check for whether to allocate the cache
      size based on capacity or window size as the batch size is already
      incorporated into the capacity when handled by the runner.
      1feff619
  15. 21 Mar, 2025 3 commits
    • Jesse Gross's avatar
      kvcache: Optimize sliding window attention · 2d6eac90
      Jesse Gross authored
      Currently sliding window attention allocates and uses the full
      context size and just masks out any tokens that are outside of the
      window. However, we really only need (roughly) the sliding window
      size.
      
      At large context sizes this improves two things:
       - Memory allocated - since the fully context size is allocated up front,
         memory requirements drop substantially. On Gemma3:4b with a 32k
         context window, total memory usage (including weights and non-sliding
         layers) drops from ~20GB to ~8GB.
       - Computation - ranges that are completely outside of the sliding
         window are now removed from the tensors that are returned from the
         cache rather than simply being masked out. This results in more
         efficient processing, scaling with the size of the context that
         has actually been used.
      
      Notable, this does not update the scheduler for any model to be aware of
      the smaller memory requirements. This is difficult for Gemma3 because
      the layers are heterogeneous between sliding and non-sliding attention.
      As a result, while actual memory consumption will be reduced, the
      scheduler will over-estimate the requirements of the model. This means
      that splitting between GPUs or GPUs and CPUs will still be suboptimal.
      
      Bug #9730
      2d6eac90
    • Jesse Gross's avatar
      kvcache: Pass granular cache size into implementations · 3ed7ad3a
      Jesse Gross authored
      Currently the runner computes the kv size needed and creates a
      cache of that size. This is the context size times number of
      parallel sequences.
      
      Cache implementations can make better decisions about their memory
      usage, so instead pass in the required capacity, number of sequences
      and maximum batch size. For now, the causal cache just uses this to
      compute the size in the same way as before.
      3ed7ad3a
    • Jesse Gross's avatar
      kvcache: Account for source tensors in defrag operation count · d3e9ca3e
      Jesse Gross authored
      Defragging the KV cache can generate a lot of operations, so we
      need to be careful that we don't overflow the number that the graph
      can support. We currently account for all of the nodes that we add
      to the graph for each move but we also need to include the original
      cache tensors as well.
      
      Fixes #9904
      d3e9ca3e
  16. 20 Mar, 2025 1 commit
  17. 11 Mar, 2025 2 commits
  18. 10 Mar, 2025 1 commit
    • Jesse Gross's avatar
      model: Update encoder cache to use multimodal input processing handler · a1cda80b
      Jesse Gross authored
      The encoder cache needs to know the position of images in the input
      stream so that it knows when to delete them. Previously images didn't
      have a position, so we implied one by breaking batches before an
      image and then assuming the image was in the first position. However,
      multimodal objects are now given explicit positions in the input
      stream, so we can use that instead.
      
      Breaking batches was also a way to simulate a cross attention mask
      for mllama. However, given that it only supports a single sequence
      and a single image, this mask doesn't serve any real purpose.
      Removing the batch break does not appear to affect the quality of
      the output.
      
      Most of this is simply moving the input data structures to a new
      package to avoid import cycles.
      a1cda80b
  19. 08 Mar, 2025 2 commits
  20. 07 Mar, 2025 2 commits
  21. 02 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ml: Enable support for flash attention · 21aa666a
      Jesse Gross authored
      The GGML flash attention kernel has specific requirements for
      padding and permutation. This adds support to the KV cache
      for conforming to these requirements so that flash attention
      can be enabled.
      
      Flash attention can be used in the same situations as the llama
      engine and is enabled by the user in the same way.
      21aa666a
    • Jesse Gross's avatar
      attention: Remove unnecessary contiguous operations · 854a9195
      Jesse Gross authored
      Prior to performing attention, we need to permute query, key
      and value. Currently we call Contiguous after each of these
      permutations, which is correct but expensive. Avoiding the
      3 calls to Contiguous increases performance by over 20%.
      
      The permutations of query and key do not violate the continuity
      rules for mulmat and the Contiguous call can be simply removed.
      
      Value requires a different permutation and does require Contiguous.
      However, we can use the copy into the cache as a way to perform this
      without further overhead.
      
      To support this and avoid unexpected tensor shapes that are seen by
      models, we need tighter integration between attention, cache
      and backend. Future optimization will also likely need this structure
       - for example, flash attention has special padding requirements in
      the cache and other backends may have their own needs.
      
      This further contains the operations that go into attention so that
      these and other optimizations can be handled transparently. Models
      that have special requirements for attention can still implement
      their own version of it.
      854a9195
  22. 27 Feb, 2025 1 commit
    • Michael Yang's avatar
      ml: update Context.Forward interface · 3e8b8a19
      Michael Yang authored
      update Context.Forward to accept multiple tensors to match
      Context.Compute signature
      
      update Context.Forward to return Context such that it can be chained
      with Context.Compute
      3e8b8a19
  23. 14 Feb, 2025 1 commit
    • Jesse Gross's avatar
      Runner for Ollama engine · ed443a03
      Jesse Gross authored
      This provides integration with the new Ollama engine
      (58245413 next ollama runner (#7913)) and the rest of the Ollama
      infrastructure such as the runner and Ollama server.
      
      In addition, it also builds out the KV cache infrastructure to
      support requirements of how Ollama runs models such as:
       - Parallel processing
       - Memory management for defragmentation and shifting
       - Multi-modal modals
      
      Both old and new engines continue to be supported. By default, only
      the old engine is used. To enable the new engine:
      
      Start the server with the OLLAMA_NEW_ENGINE environment variable set:
      OLLAMA_NEW_ENGINE=1 ./ollama serve
      
      Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
      ./ollama run jessegross/llama3.1
      ed443a03