1. 10 Oct, 2025 1 commit
    • Michael Yang's avatar
      ollamarunner: fix deadlock · 1a2feb2a
      Michael Yang authored
      hardErrCh will deadlock since forwardBatch is blocked on
      computeStartedCh which never gets sent. since the response to
      hardErrCh is to panic, just panic instead
      1a2feb2a
  2. 09 Oct, 2025 3 commits
  3. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
  4. 17 Sep, 2025 1 commit
  5. 16 Sep, 2025 1 commit
  6. 15 Sep, 2025 1 commit
  7. 12 Sep, 2025 2 commits
  8. 11 Sep, 2025 1 commit
  9. 10 Sep, 2025 1 commit
  10. 09 Sep, 2025 1 commit
    • Jesse Gross's avatar
      llm: Clamp batch size to context size · e119783e
      Jesse Gross authored
      The context must always be able to store the current batch, so
      if the user requests a small context then we should also shrink
      the batch to match. This also fixes the TestLongInputContext
      test on the new engine. (The old engine already has this behavior.)
      e119783e
  11. 08 Sep, 2025 2 commits
  12. 04 Sep, 2025 2 commits
  13. 29 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
  14. 22 Aug, 2025 1 commit
  15. 14 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
  16. 08 Aug, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Support closing backends · 756c78cf
      Jesse Gross authored
      In order to iteratively find the best memory allocation, we need to
      be able to free backend memory so we can try again.
      756c78cf
  17. 22 May, 2025 2 commits
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
    • Jesse Gross's avatar
      ollamarunner: Memory usage reporting · 73d6a82c
      Jesse Gross authored
      This provides granular information about the backend memory allocations
      required by the runner:
       - Per backend
       - Per layer
       - Weights, cache and graph
       - Allocation status
      
      This can be used for debugging and validating memory estimates.
      73d6a82c
  18. 19 May, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Seperate tensor load from backend creation · 94ab428e
      Jesse Gross authored
      Currently, when the backend is created, the tensors are loaded at the
      same time, which is a slow operation. This separates them to be two
      steps:
       - Create backend, including enumerating tensors and memory allocation
       - Loading tensor data
      
      This allows more flexibility in managing model loading.
      94ab428e
  19. 15 May, 2025 3 commits
    • Jesse Gross's avatar
      ollamarunner: Multi-modal worst case graph · fe623c2c
      Jesse Gross authored
      We currently preallocate compute graph memory for the worst case
      batch of text tokens. This adds support for doing the same for
      images.
      
      Note that image models are more complicated than text models in
      how they process their inputs so there may be cases where this
      approach isn't completely generic for all models. It covers all
      currently supported models though.
      fe623c2c
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d
    • Jesse Gross's avatar
      ollamarunner: Base cached tokens on current prompt · 499ae731
      Jesse Gross authored
      When we restore a sequence from the cache, we split the prompt into
      the already used tokens (stored in the cache) and new tokens that
      need to be processed. Currently, the references to the used tokens
      are coming from the stored previous sequence.
      
      However, even though we know that the used tokens are semantically
      equivalent to the prefix of the prompt, tokens can contain pointers
      which are no longer valid. As a result, it is better to get the
      used tokens from the prompt, which has currently valid pointers.
      
      This doesn't currently have any impact because it isn't possible
      to reuse the pointers (which are tensors) anyways. However, it
      becomes an issue once we can.
      499ae731
  20. 12 May, 2025 1 commit
  21. 08 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Use correct constant to remove cache entries · 3d9498a4
      Jesse Gross authored
      The correct constant to remove all entries to the end of the sequence
      for the Ollama engine is math.MaxInt32. -1 is used by the old engine.
      
      The impact of this is currently minimal because it would only occur
      in situations that are not supported by the implemented models or
      rarely used options.
      3d9498a4
  22. 05 May, 2025 1 commit
  23. 02 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Re-enable worst case graph preallocation. · c2f5d666
      Jesse Gross authored
      Worst case graph preallocation was disabled by a27462b7
      "ollamarunner: Temporarily disable worst case graph preallocation"
      since it caused crashes with large batches when not using the GPU.
      
      This backports upstream llama.cpp commit f057808
      "ggml: Don't assert fail when tensor data changes (#13222)", which
      fixes the underlying bug and allows reverting the previous workaround.
      c2f5d666
  24. 01 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Fix memory leak when processing images · 8e8f2c6d
      Jesse Gross authored
      The context (and therefore associated input tensors) was not being
      properly closed when images were being processed. We were trying to
      close them but in reality we were closing over an empty list, preventing
      anything from actually being freed.
      
      Fixes #10434
      8e8f2c6d
  25. 29 Apr, 2025 1 commit
  26. 24 Apr, 2025 1 commit
  27. 08 Apr, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Preallocate worst case graph at startup · dbb149e6
      Jesse Gross authored
      Currently, the KV cache and graph are lazily allocated as needed.
      The cache is fully allocated on first use of the corresponding
      layer whereas the graph grows with the size of the context.
      
      This can be an issue if another application allocates more VRAM
      after we do our calculations - Ollama will crash in the middle of
      inference. If we instead allocate the maximum needed memory at
      startup of the runner, we will either succeed or fail at that point
      rather than at some surprising time in the future.
      
      Currently, this only generates a worst case batch for text, which
      means that vision models may get a partial allocation and continue
      to lazily allocate the rest.
      dbb149e6
  28. 03 Apr, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: set done reason at server level (#9830) · e53b3cbd
      Bruce MacDonald authored
      No functional change. Many different done reasons can be set at the runner
      level, so rather than obsuring them we should return them to the server
      process and let it choose what to do with the done reason. This separates
      the API concerns from the runner.
      e53b3cbd
  29. 02 Apr, 2025 2 commits
    • jmorganca's avatar
      kvcache: Add check for values that fall out of sliding window cache · b4297006
      jmorganca authored
      
      
      The sliding window cache trims entries that are outside the window for
      the latest token. This works when we are extending the cache, such as
      when the conversation continues. However, if we have a partial overlap
      in conversation (including the BOS tokens), then we resume from a past
      point in the conversation and the needed tokens are no longer stored
      in memory. This verifies that the new window overlaps with the old one
      before reusing the cache.
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      b4297006
    • Jesse Gross's avatar
      ollamarunner: Don't truncate a SameBatch · 493385eb
      Jesse Gross authored
      When truncating inputs to the the context window at the beginning of
      a sequence, we remove the minimum amount possible. However, this
      may cause us to truncate to the middle of a set of inputs that
      the model specified should not be split up. To avoid this, we
      need to remove the rest of the partial batch.
      493385eb
  30. 31 Mar, 2025 2 commits
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392
    • Jesse Gross's avatar
      runner: Release semaphore and improve error messages on failures · b2a46529
      Jesse Gross authored
      If we have an error after creating a new sequence but before
      finding a slot for it, we return without releasing the semaphore.
      This reduces our parallel sequences and eventually leads to deadlock.
      
      In practice this should never happen because once we have acquired
      the semaphore, we should always be able to find a slot. However, the
      code is clearly not correct.
      b2a46529