1. 20 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: Don't always evict models in CPU-only mode · 073fa31d
      Jesse Gross authored
      With old memory estimates, it's currently impossible to load more
      than one model at a time when no GPUs are available. This is because
      the check for whether we need to evict a model looks to see if all
      layers of the new model can be loaded onto GPUs, which is never true
      if there are no GPUs. Before the memory management changes, there
      was a special code path for CPU-only systems.
      
      This problem does not exist with new memory estimates.
      
      Fixes #11974
      073fa31d
  2. 14 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
  3. 19 May, 2025 4 commits
    • Jesse Gross's avatar
      llm: Use first layer as memory buffer in estimation · 3fe74fba
      Jesse Gross authored
      This is a partial revert of 0478d440 "Fixed over vram allcation dure to
      small initial layer sizes."
      
      Previously we used the size of the first layer as an extra reserved
      amount of space to buffer our memory estimates. The above commit
      changed this to use the largest layer. However, this had performance
      impacts on more models than the original commit was trying to fix.
      
      There is just a heuristic without an ideal solution so this goes back
      to the historic behavior.
      
      Fixes: #10765, #10756, #10752, #10726
      3fe74fba
    • Jesse Gross's avatar
      ggml: Seperate tensor load from backend creation · 94ab428e
      Jesse Gross authored
      Currently, when the backend is created, the tensors are loaded at the
      same time, which is a slow operation. This separates them to be two
      steps:
       - Create backend, including enumerating tensors and memory allocation
       - Loading tensor data
      
      This allows more flexibility in managing model loading.
      94ab428e
    • Jesse Gross's avatar
      llm: Estimate projector memory correctly for Ollama engine · d7555774
      Jesse Gross authored
      The Llama engine always places vision projectors on the first GPU
      if one exists. However, the Ollama engine groups it with the output
      layer, which means the projector is only offloaded if all other layers
      are offloaded. The memory estimation code always assumes the former
      layout - this changes it to use the correct layout based on the engine.
      
      This addresses two impacts of the current behavior:
       - In multi-GPU setups, we can crash with OOM errors when we try to
         allocate memory on a full GPU while another still has space.
       - If the vision projector is large, it may prevent us from offloading
         anything when we could have fit some of the text layers.
      d7555774
    • Jesse Gross's avatar
      llm: Consistently track unassigned model data · a2cc8571
      Jesse Gross authored
      In some cases, if we fail to assign a piece of the model to a GPU then
      we lose track of this data. Although it doesn't change the memory
      allocation, it does affect the total size of the model reported by
      tools such as ollama ps (and also the percent offloaded).
      
      This makes it look like setting num_gpu isn't reflected in ollama ps,
      which isn't true but the offloading percent may appear to not change.
      
      Spreading the model across more GPUs will continue to impact the
      reported total size of the model.
      a2cc8571
  4. 14 May, 2025 1 commit
  5. 13 May, 2025 1 commit
  6. 27 Apr, 2025 1 commit
    • Devon Rifkin's avatar
      ggml: fix crash for array head counts · 6ed88985
      Devon Rifkin authored
      If it's an array, it uses the max value in the array
      
      If array values for head counts becomes more popular, we can consider a
      more invasive change like #10225 to calculate more accurate estimates.
      
      Fixes: #9984
      6ed88985
  7. 25 Apr, 2025 1 commit
  8. 26 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3
      Jesse Gross authored
      Gemma3 uses sliding windows for its context on 5/6 layers, significantly
      reducing memory usage but leading to uneven usage across layers,
      which makes allocation to the correct GPU difficult. We currently
      estimate very conservatively by assuming all layers are consistent
      at the max size.
      
      Llama3.2-vision is also inconsistent between self attention and cross
      attention layers - at moment, we calculate the correct total size
      and then average this across layers. In some cases, this may lead
      to crashes if a large layer is placed on a GPU sized by the average.
      
      This allows memory estimation to calculate per-layer KV cache size
      and take this account when placing layers onto GPUs. We already do
      this for weights that vary per-tensor, so this is a logical extension.
      
      Fixes #9730
      Fixes #9890
      f66216e3
    • Jesse Gross's avatar
      llm: Fix debug logging for memory estimates · f4f0992b
      Jesse Gross authored
      f4f0992b
  9. 13 Mar, 2025 1 commit
  10. 04 Mar, 2025 1 commit
    • Daniel Hiltgen's avatar
      New engine: vision models and auto-fallback (#9113) · 1fdb351c
      Daniel Hiltgen authored
      * Include unified vision layers in memory prediction
      
      For newer vision models with a single gguf, include
      the projection estimates.
      
      * Adjust CLI to handle both styles of vision model metadata
      
      * Wire up new tokenizers for new engine
      
      If we're loading the new engine, utilize the new model
      text processor instead of calling into cgo wrappers for
      llama.cpp.  This also cleans up some tech debt from the
      older tokenization flow for the C++ server which was
      no longer used.
      
      This also adjusts the grammar handling logic to pass
      through to the new engine instead of utilizing the cgo
      schema to grammar call.
      
      * Lay foundation for auto selection of new engine
      1fdb351c
  11. 14 Feb, 2025 1 commit
    • Michael Yang's avatar
      next ollama runner (#7913) · 58245413
      Michael Yang authored
      
      
      feat: add new Ollama engine using ggml through cgo
      
      This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.
      
      - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
      - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
      - `ml.Tensor` defines the interface for a tensor and tensor operations
      
      This is the first implementation of the new engine. Follow up PRs will implement more features:
      
      - non-greedy sampling (#8410)
      - integration with Ollama and KV caching (#8301)
      - more model support (#9080) with more coming soon
      Co-authored-by: default avatarBruce MacDonald <brucewmacdonald@gmail.com>
      58245413
  12. 10 Dec, 2024 1 commit
  13. 04 Dec, 2024 1 commit
  14. 03 Dec, 2024 1 commit
  15. 01 Nov, 2024 1 commit
  16. 18 Oct, 2024 1 commit
  17. 17 Oct, 2024 1 commit
  18. 06 Sep, 2024 1 commit
  19. 05 Sep, 2024 1 commit
    • Daniel Hiltgen's avatar
      Introduce GPU Overhead env var (#5922) · b05c9e83
      Daniel Hiltgen authored
      Provide a mechanism for users to set aside an amount of VRAM on each GPU
      to make room for other applications they want to start after Ollama, or workaround
      memory prediction bugs
      b05c9e83
  20. 20 Jun, 2024 1 commit
  21. 18 Jun, 2024 2 commits
    • Daniel Hiltgen's avatar
      Handle models with divergent layer sizes · 359b15a5
      Daniel Hiltgen authored
      The recent refactoring of the memory prediction assumed all layers
      are the same size, but for some models (like deepseek-coder-v2) this
      is not the case, so our predictions were significantly off.
      359b15a5
    • Daniel Hiltgen's avatar
      Tighten up memory prediction logging · 7784ca33
      Daniel Hiltgen authored
      Prior to this change, we logged the memory prediction multiple times
      as the scheduler iterates to find a suitable configuration, which can be
      confusing since only the last log before the server starts is actually valid.
      This now logs once just before starting the server on the final configuration.
      It also reports what library instead of always saying "offloading to gpu" when
      using CPU.
      7784ca33
  22. 14 Jun, 2024 3 commits
  23. 04 Jun, 2024 2 commits
  24. 24 May, 2024 1 commit
  25. 13 May, 2024 2 commits
  26. 10 May, 2024 1 commit
  27. 08 May, 2024 1 commit
  28. 07 May, 2024 1 commit
  29. 05 May, 2024 1 commit
    • Daniel Hiltgen's avatar
      Centralize server config handling · f56aa200
      Daniel Hiltgen authored
      This moves all the env var reading into one central module
      and logs the loaded config once at startup which should
      help in troubleshooting user server logs
      f56aa200
  30. 01 May, 2024 1 commit
  31. 26 Apr, 2024 1 commit