1. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
  2. 30 Sep, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Remove allocation status reporting · 734b57da
      Jesse Gross authored
      For each memory allocation we report the size of the (attempted)
      allocation and whether it succeeded or failed. The latter status
      reporting proved to be not that useful in practice as systems
      such as Windows can automatically overflow from VRAM into RAM,
      resultings in successful allocations even when there isn't
      enough memory where we wanted.
      
      As a result, this information is only used for debug logging,
      which isn't worthwhile enough for the amount of code. It
      also isn't fully accurate, as multiple allocations may result
      in partial failures.
      734b57da
  3. 12 Sep, 2025 2 commits
  4. 11 Sep, 2025 2 commits
    • Jesse Gross's avatar
      llm: Don't try to load split vision models in the Ollama engine · aba15753
      Jesse Gross authored
      If a model with a split vision projector is loaded in the Ollama
      engine, the projector will be ignored and the model will hallucinate
      a response. Instead, fallback and try to load the model in the llama
      engine.
      aba15753
    • Jesse Gross's avatar
      llm: Enable new memory estimates by default · eb10390d
      Jesse Gross authored
      New memory estimates (see #11090 for more information) are now
      enabled automatically for all models running on the Ollama engine,
      improving both stability and performance through more accurate sizing
      and allocation. Models running on the llama engine will continue to
      use the original style of memory estimation.
      eb10390d
  5. 10 Sep, 2025 2 commits
  6. 09 Sep, 2025 1 commit
    • Jesse Gross's avatar
      llm: Clamp batch size to context size · e119783e
      Jesse Gross authored
      The context must always be able to store the current batch, so
      if the user requests a small context then we should also shrink
      the batch to match. This also fixes the TestLongInputContext
      test on the new engine. (The old engine already has this behavior.)
      e119783e
  7. 08 Sep, 2025 1 commit
  8. 02 Sep, 2025 2 commits
  9. 29 Aug, 2025 1 commit
  10. 26 Aug, 2025 1 commit
  11. 20 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: Don't always evict models in CPU-only mode · 073fa31d
      Jesse Gross authored
      With old memory estimates, it's currently impossible to load more
      than one model at a time when no GPUs are available. This is because
      the check for whether we need to evict a model looks to see if all
      layers of the new model can be loaded onto GPUs, which is never true
      if there are no GPUs. Before the memory management changes, there
      was a special code path for CPU-only systems.
      
      This problem does not exist with new memory estimates.
      
      Fixes #11974
      073fa31d
  12. 18 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: Check for nil memory data before printing · e3ade453
      Jesse Gross authored
      We dump out our best memory estimate after we complete processing
      for any reason, including errors. This is helpful for finding what
      what stopped us in error conditions but in some cases we might not
      have gotten even the first result yet.
      
      Fixes #11957
      e3ade453
  13. 14 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
  14. 23 Jun, 2025 2 commits
    • Daniel Hiltgen's avatar
      avoid context overflow (#11175) · 10a8e04a
      Daniel Hiltgen authored
      For smaller context models, make sure we do not exceed the training size.
      10a8e04a
    • Daniel Hiltgen's avatar
      Re-remove cuda v11 (#10694) · 1c6669e6
      Daniel Hiltgen authored
      * Re-remove cuda v11
      
      Revert the revert - drop v11 support requiring drivers newer than Feb 23
      
      This reverts commit c6bcdc42.
      
      * Simplify layout
      
      With only one version of the GPU libraries, we can simplify things down somewhat.  (Jetsons still require special handling)
      
      * distinct sbsa variant for linux arm64
      
      This avoids accidentally trying to load the sbsa cuda libraries on
      a jetson system which results in crashes.
      
      * temporary prevent rocm+cuda mixed loading
      1c6669e6
  15. 29 May, 2025 1 commit
    • Jesse Gross's avatar
      llm: Make "POST predict" error message more informative · f15ffc43
      Jesse Gross authored
      "POST predict" basically means that the runner has crashed, which
      can have many reasons. However, many people think this is a specific
      error and either report only this message or group together unrelated
      bugs. This replaces it with a more friendly and helpful message.
      f15ffc43
  16. 19 May, 2025 4 commits
    • Jesse Gross's avatar
      llm: Use first layer as memory buffer in estimation · 3fe74fba
      Jesse Gross authored
      This is a partial revert of 0478d440 "Fixed over vram allcation dure to
      small initial layer sizes."
      
      Previously we used the size of the first layer as an extra reserved
      amount of space to buffer our memory estimates. The above commit
      changed this to use the largest layer. However, this had performance
      impacts on more models than the original commit was trying to fix.
      
      There is just a heuristic without an ideal solution so this goes back
      to the historic behavior.
      
      Fixes: #10765, #10756, #10752, #10726
      3fe74fba
    • Jesse Gross's avatar
      ggml: Seperate tensor load from backend creation · 94ab428e
      Jesse Gross authored
      Currently, when the backend is created, the tensors are loaded at the
      same time, which is a slow operation. This separates them to be two
      steps:
       - Create backend, including enumerating tensors and memory allocation
       - Loading tensor data
      
      This allows more flexibility in managing model loading.
      94ab428e
    • Jesse Gross's avatar
      llm: Estimate projector memory correctly for Ollama engine · d7555774
      Jesse Gross authored
      The Llama engine always places vision projectors on the first GPU
      if one exists. However, the Ollama engine groups it with the output
      layer, which means the projector is only offloaded if all other layers
      are offloaded. The memory estimation code always assumes the former
      layout - this changes it to use the correct layout based on the engine.
      
      This addresses two impacts of the current behavior:
       - In multi-GPU setups, we can crash with OOM errors when we try to
         allocate memory on a full GPU while another still has space.
       - If the vision projector is large, it may prevent us from offloading
         anything when we could have fit some of the text layers.
      d7555774
    • Jesse Gross's avatar
      llm: Consistently track unassigned model data · a2cc8571
      Jesse Gross authored
      In some cases, if we fail to assign a piece of the model to a GPU then
      we lose track of this data. Although it doesn't change the memory
      allocation, it does affect the total size of the model reported by
      tools such as ollama ps (and also the percent offloaded).
      
      This makes it look like setting num_gpu isn't reflected in ollama ps,
      which isn't true but the offloading percent may appear to not change.
      
      Spreading the model across more GPUs will continue to impact the
      reported total size of the model.
      a2cc8571
  17. 14 May, 2025 1 commit
  18. 13 May, 2025 2 commits
  19. 12 May, 2025 1 commit
  20. 08 May, 2025 1 commit
  21. 07 May, 2025 2 commits
    • Daniel Hiltgen's avatar
      sched: fix race leading to orphaned runners (#10599) · 5e380c3b
      Daniel Hiltgen authored
      If a model is loading, and the request context is canceled during the load
      by a client closing the connection, and another request is inbound for the
      same model with a different configuration (context size, etc.) thus requiring
      a reload, two unload events can be in flight.  The first shuts down the
      original model load, but the second one caused the loss of the new
      reloading runner reference, thus triggering the leak.
      
      The primary fix is detecting the duplicate unload and ignoring the second
      instance.  The load routine is also hardened to ensure we detect
      clobbering an already present runner and unload it with a warning.
      5e380c3b
    • Daniel Hiltgen's avatar
      remove cuda v11 (#10569) · fa393554
      Daniel Hiltgen authored
      This reduces the size of our Windows installer payloads by ~256M by dropping
      support for nvidia drivers older than Feb 2023.  Hardware support is unchanged.
      
      Linux default bundle sizes are reduced by ~600M to 1G.
      fa393554
  22. 06 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      Move quantization to new backend (#10363) · 42481045
      Daniel Hiltgen authored
      * Move quantization logic to GGML via new backend
      
      This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
      
      * Remove "add model quantizations"
      
      This is no longer needed now that quantization is implemented in Go+GGML code directly.
      42481045
  23. 05 May, 2025 3 commits
  24. 03 May, 2025 2 commits
    • Daniel Hiltgen's avatar
      win: ensure ollama paths come first (#10549) · 6a74bba7
      Daniel Hiltgen authored
      For all search path env vars make sure our dirs are first
      to avoid potentially finding other incompatible libraries
      on the users system.
      
      Also fixes a minor build script glitch for windows rocm
      6a74bba7
    • Daniel Hiltgen's avatar
      sched: logging improvements (#10550) · 76ea735a
      Daniel Hiltgen authored
      This enhances our logging in the scheduler.  The initial "waiting for server" log
      no longer claims an initial error state (now "not responding" which better reflects
      the actual state).  Runners now have slog wiring to report more details about the
      runner, including PID.
      76ea735a
  25. 30 Apr, 2025 1 commit
  26. 27 Apr, 2025 1 commit
    • Devon Rifkin's avatar
      ggml: fix crash for array head counts · 6ed88985
      Devon Rifkin authored
      If it's an array, it uses the max value in the array
      
      If array values for head counts becomes more popular, we can consider a
      more invasive change like #10225 to calculate more accurate estimates.
      
      Fixes: #9984
      6ed88985
  27. 25 Apr, 2025 1 commit