1. 30 Oct, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Enable op_offload to improve partial offload performance · afaf7ce8
      Jesse Gross authored
      When a model is partially offloaded to system RAM, we can either
      do the calculations on the CPU or we can temporarily transfer the
      data to the GPU to do the calculations there. Small batches tend
      to be better on the CPU, large batches on the GPU.
      
      The llamarunner used the GPU in most cases and the ollamarunner
      used the CPU. Although the ollamarunner saw an improvement in
      token generation performance, there was a large performance hit
      in prompt processing (3-10x).
      
      There is an existing heuristic to dynamically switch between these
      two modes but in practice it doesn't have enough information to
      accurately make that decision. This adds authoritative data to make
      the check work to get the best of both worlds.
      
      Fixes #12037
      afaf7ce8
    • Jesse Gross's avatar
      ollamarunner: Worst case batch for token generation · 26465fb8
      Jesse Gross authored
      We currently allocate the worst case batch for max sized
      batches, which corresponds to prompt processing. However,
      there are some cases where the generated graph is different
      for small and large batches. To ensure that we don't need
      to allocate memory later after layout has taken place, we
      should run the worst case batch both ways and take the larger
      amount of memory.
      
      This does not noticeably affect loading speed as the most expensive
      part of this logic is from image processing and that does not
      occur during token generation.
      26465fb8
  2. 29 Oct, 2025 1 commit
  3. 28 Oct, 2025 2 commits
  4. 27 Oct, 2025 1 commit
    • nicole pardal's avatar
      server: Consolidate embedding truncation in runner (#12730) · 5d347f6d
      nicole pardal authored
      Currently, checking the length of prompts for embeddings to ensure
      they fit in the context window (and possible truncation) occurs in
      two places - the Ollama server and runner. This can lead to
      inconsistencies in both the checks and reported number of tokens
      processed. Since we have to do this processing in the runner, this
      consolidates all of the logic there.
      5d347f6d
  5. 20 Oct, 2025 1 commit
  6. 11 Oct, 2025 1 commit
  7. 10 Oct, 2025 1 commit
    • Michael Yang's avatar
      ollamarunner: fix deadlock · 1a2feb2a
      Michael Yang authored
      hardErrCh will deadlock since forwardBatch is blocked on
      computeStartedCh which never gets sent. since the response to
      hardErrCh is to panic, just panic instead
      1a2feb2a
  8. 09 Oct, 2025 3 commits
  9. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
  10. 16 Sep, 2025 1 commit
  11. 15 Sep, 2025 1 commit
  12. 12 Sep, 2025 2 commits
  13. 11 Sep, 2025 1 commit
  14. 10 Sep, 2025 1 commit
  15. 08 Sep, 2025 1 commit
  16. 04 Sep, 2025 2 commits
  17. 29 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
  18. 14 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
  19. 08 Aug, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Support closing backends · 756c78cf
      Jesse Gross authored
      In order to iteratively find the best memory allocation, we need to
      be able to free backend memory so we can try again.
      756c78cf
  20. 22 May, 2025 2 commits
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
    • Jesse Gross's avatar
      ollamarunner: Memory usage reporting · 73d6a82c
      Jesse Gross authored
      This provides granular information about the backend memory allocations
      required by the runner:
       - Per backend
       - Per layer
       - Weights, cache and graph
       - Allocation status
      
      This can be used for debugging and validating memory estimates.
      73d6a82c
  21. 19 May, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Seperate tensor load from backend creation · 94ab428e
      Jesse Gross authored
      Currently, when the backend is created, the tensors are loaded at the
      same time, which is a slow operation. This separates them to be two
      steps:
       - Create backend, including enumerating tensors and memory allocation
       - Loading tensor data
      
      This allows more flexibility in managing model loading.
      94ab428e
  22. 15 May, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Multi-modal worst case graph · fe623c2c
      Jesse Gross authored
      We currently preallocate compute graph memory for the worst case
      batch of text tokens. This adds support for doing the same for
      images.
      
      Note that image models are more complicated than text models in
      how they process their inputs so there may be cases where this
      approach isn't completely generic for all models. It covers all
      currently supported models though.
      fe623c2c
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d
  23. 12 May, 2025 1 commit
  24. 05 May, 2025 1 commit
  25. 02 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Re-enable worst case graph preallocation. · c2f5d666
      Jesse Gross authored
      Worst case graph preallocation was disabled by a27462b7
      "ollamarunner: Temporarily disable worst case graph preallocation"
      since it caused crashes with large batches when not using the GPU.
      
      This backports upstream llama.cpp commit f057808
      "ggml: Don't assert fail when tensor data changes (#13222)", which
      fixes the underlying bug and allows reverting the previous workaround.
      c2f5d666
  26. 01 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Fix memory leak when processing images · 8e8f2c6d
      Jesse Gross authored
      The context (and therefore associated input tensors) was not being
      properly closed when images were being processed. We were trying to
      close them but in reality we were closing over an empty list, preventing
      anything from actually being freed.
      
      Fixes #10434
      8e8f2c6d
  27. 29 Apr, 2025 1 commit
  28. 24 Apr, 2025 1 commit
  29. 08 Apr, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Preallocate worst case graph at startup · dbb149e6
      Jesse Gross authored
      Currently, the KV cache and graph are lazily allocated as needed.
      The cache is fully allocated on first use of the corresponding
      layer whereas the graph grows with the size of the context.
      
      This can be an issue if another application allocates more VRAM
      after we do our calculations - Ollama will crash in the middle of
      inference. If we instead allocate the maximum needed memory at
      startup of the runner, we will either succeed or fail at that point
      rather than at some surprising time in the future.
      
      Currently, this only generates a worst case batch for text, which
      means that vision models may get a partial allocation and continue
      to lazily allocate the rest.
      dbb149e6
  30. 03 Apr, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: set done reason at server level (#9830) · e53b3cbd
      Bruce MacDonald authored
      No functional change. Many different done reasons can be set at the runner
      level, so rather than obsuring them we should return them to the server
      process and let it choose what to do with the done reason. This separates
      the API concerns from the runner.
      e53b3cbd
  31. 02 Apr, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Don't truncate a SameBatch · 493385eb
      Jesse Gross authored
      When truncating inputs to the the context window at the beginning of
      a sequence, we remove the minimum amount possible. However, this
      may cause us to truncate to the middle of a set of inputs that
      the model specified should not be split up. To avoid this, we
      need to remove the rest of the partial batch.
      493385eb
  32. 31 Mar, 2025 1 commit
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392