1. 10 Oct, 2025 1 commit
    • Michael Yang's avatar
      ollamarunner: fix deadlock · 1a2feb2a
      Michael Yang authored
      hardErrCh will deadlock since forwardBatch is blocked on
      computeStartedCh which never gets sent. since the response to
      hardErrCh is to panic, just panic instead
      1a2feb2a
  2. 09 Oct, 2025 3 commits
  3. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
  4. 16 Sep, 2025 1 commit
  5. 15 Sep, 2025 1 commit
  6. 12 Sep, 2025 2 commits
  7. 11 Sep, 2025 1 commit
  8. 10 Sep, 2025 1 commit
  9. 08 Sep, 2025 1 commit
  10. 04 Sep, 2025 2 commits
  11. 29 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
  12. 14 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
  13. 08 Aug, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Support closing backends · 756c78cf
      Jesse Gross authored
      In order to iteratively find the best memory allocation, we need to
      be able to free backend memory so we can try again.
      756c78cf
  14. 22 May, 2025 2 commits
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
    • Jesse Gross's avatar
      ollamarunner: Memory usage reporting · 73d6a82c
      Jesse Gross authored
      This provides granular information about the backend memory allocations
      required by the runner:
       - Per backend
       - Per layer
       - Weights, cache and graph
       - Allocation status
      
      This can be used for debugging and validating memory estimates.
      73d6a82c
  15. 19 May, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Seperate tensor load from backend creation · 94ab428e
      Jesse Gross authored
      Currently, when the backend is created, the tensors are loaded at the
      same time, which is a slow operation. This separates them to be two
      steps:
       - Create backend, including enumerating tensors and memory allocation
       - Loading tensor data
      
      This allows more flexibility in managing model loading.
      94ab428e
  16. 15 May, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Multi-modal worst case graph · fe623c2c
      Jesse Gross authored
      We currently preallocate compute graph memory for the worst case
      batch of text tokens. This adds support for doing the same for
      images.
      
      Note that image models are more complicated than text models in
      how they process their inputs so there may be cases where this
      approach isn't completely generic for all models. It covers all
      currently supported models though.
      fe623c2c
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d
  17. 12 May, 2025 1 commit
  18. 05 May, 2025 1 commit
  19. 02 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Re-enable worst case graph preallocation. · c2f5d666
      Jesse Gross authored
      Worst case graph preallocation was disabled by a27462b7
      "ollamarunner: Temporarily disable worst case graph preallocation"
      since it caused crashes with large batches when not using the GPU.
      
      This backports upstream llama.cpp commit f057808
      "ggml: Don't assert fail when tensor data changes (#13222)", which
      fixes the underlying bug and allows reverting the previous workaround.
      c2f5d666
  20. 01 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Fix memory leak when processing images · 8e8f2c6d
      Jesse Gross authored
      The context (and therefore associated input tensors) was not being
      properly closed when images were being processed. We were trying to
      close them but in reality we were closing over an empty list, preventing
      anything from actually being freed.
      
      Fixes #10434
      8e8f2c6d
  21. 29 Apr, 2025 1 commit
  22. 24 Apr, 2025 1 commit
  23. 08 Apr, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Preallocate worst case graph at startup · dbb149e6
      Jesse Gross authored
      Currently, the KV cache and graph are lazily allocated as needed.
      The cache is fully allocated on first use of the corresponding
      layer whereas the graph grows with the size of the context.
      
      This can be an issue if another application allocates more VRAM
      after we do our calculations - Ollama will crash in the middle of
      inference. If we instead allocate the maximum needed memory at
      startup of the runner, we will either succeed or fail at that point
      rather than at some surprising time in the future.
      
      Currently, this only generates a worst case batch for text, which
      means that vision models may get a partial allocation and continue
      to lazily allocate the rest.
      dbb149e6
  24. 03 Apr, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: set done reason at server level (#9830) · e53b3cbd
      Bruce MacDonald authored
      No functional change. Many different done reasons can be set at the runner
      level, so rather than obsuring them we should return them to the server
      process and let it choose what to do with the done reason. This separates
      the API concerns from the runner.
      e53b3cbd
  25. 02 Apr, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Don't truncate a SameBatch · 493385eb
      Jesse Gross authored
      When truncating inputs to the the context window at the beginning of
      a sequence, we remove the minimum amount possible. However, this
      may cause us to truncate to the middle of a set of inputs that
      the model specified should not be split up. To avoid this, we
      need to remove the rest of the partial batch.
      493385eb
  26. 31 Mar, 2025 3 commits
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392
    • Jesse Gross's avatar
      runner: Release semaphore and improve error messages on failures · b2a46529
      Jesse Gross authored
      If we have an error after creating a new sequence but before
      finding a slot for it, we return without releasing the semaphore.
      This reduces our parallel sequences and eventually leads to deadlock.
      
      In practice this should never happen because once we have acquired
      the semaphore, we should always be able to find a slot. However, the
      code is clearly not correct.
      b2a46529
    • Jesse Gross's avatar
      ollamarunner: Ensure batch size limits are not exceeded · 5d097277
      Jesse Gross authored
      With the llama runner, we can generate up to NUM_PARALLEL batches
      at once, which will then get broken up to into individual batches
      to get executed by llama.cpp (i.e. we add up to 2048 tokens and
      this gets split into 4 batches of 512 tokens at default settings).
      
      This splitting can improve parallelism on multi-GPU systems because
      the individual batches can move though the pipeline without blocking
      on the first one to fully complete. However, we don't yet support
      this in the Ollama runner, partially because it makes it hard to
      enforce model-specified batch constraints, which didn't exist
      previously.
      
      The result is that we will try to execute the full, unsplit batch.
      This could result in out of memory or insufficient KV cache space
      errors.
      
      This triggers batch breaking when the total inputs from all sequences
      exceeds the batch size, rather than per-sequence. In order to ensure
      fairness, it also reintroduces round-robinning around sequences so
      that we don't let one busy sequence starve the others.
      5d097277
  27. 21 Mar, 2025 3 commits
    • Michael Yang's avatar
      74bd0965
    • Jesse Gross's avatar
      kvcache: Pass granular cache size into implementations · 3ed7ad3a
      Jesse Gross authored
      Currently the runner computes the kv size needed and creates a
      cache of that size. This is the context size times number of
      parallel sequences.
      
      Cache implementations can make better decisions about their memory
      usage, so instead pass in the required capacity, number of sequences
      and maximum batch size. For now, the causal cache just uses this to
      compute the size in the same way as before.
      3ed7ad3a
    • Jesse Gross's avatar
      ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758
      Jesse Gross authored
      This enables the runner to report progress back to the Ollama server,
      both for showing status to the user and also to prevent the server
      from killing the runner if it thinks things have stalled.
      
      Most of the infrastructure was already there, this extends it to
      be available to the backends.
      0ff28758
  28. 20 Mar, 2025 2 commits
    • Jesse Gross's avatar
      model: Pass input tensor instead of raw data to models · 0fbfcf3c
      Jesse Gross authored
      Rather than directly giving the input data to models, we can
      pass a tensor instead. In the short term, this saves some duplicated
      code.
      
      Longer term, we will want to overlap setting up the next batch with
      processing of the current one. In this case, we will only have the
      shape of tensor but it will not be loaded with data at the time of
      graph generation. By passing only a tensor to models now, we set up
      this possibility and prevent them from relying on data that they won't
      have in the future.
      
      Although the same could be done for Positions and Outputs, in some
      cases we either need the raw input data or don't use them at all.
      Therefore, for now we leave them as they are and allow models to
      convert them to tensors as needed.
      0fbfcf3c
    • Jesse Gross's avatar
      input: Rename Options to Batch · 0c220935
      Jesse Gross authored
      Options is no longer very descriptive of this struct.
      0c220935
  29. 17 Mar, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Check for minBatch of context space when shifting · bf24498b
      Jesse Gross authored
      Models can specify that a group of inputs need to be handled a single
      batch. However, context shifting didn't respect this and could trigger
      a break anyways. In this case, we should instead trigger a context
      shift earlier so that it occurs before the grouped batch.
      
      Note that there still some corner cases:
       - A long prompt that exceeds the context window can get truncated
         in the middle of an image. With the current models, this will
         result in the model not recognizing the image at all, which is
         pretty much the expected result with truncation.
       - The context window is set less than the minimum batch size. The
         only solution to this is to refuse to load the model with these
         settings. However, this can never occur with current models and
         default settings.
      
      Since users are unlikely to run into these scenarios, fixing them is
      left as a follow up.
      bf24498b