1. 25 Apr, 2025 1 commit
  2. 26 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3
      Jesse Gross authored
      Gemma3 uses sliding windows for its context on 5/6 layers, significantly
      reducing memory usage but leading to uneven usage across layers,
      which makes allocation to the correct GPU difficult. We currently
      estimate very conservatively by assuming all layers are consistent
      at the max size.
      
      Llama3.2-vision is also inconsistent between self attention and cross
      attention layers - at moment, we calculate the correct total size
      and then average this across layers. In some cases, this may lead
      to crashes if a large layer is placed on a GPU sized by the average.
      
      This allows memory estimation to calculate per-layer KV cache size
      and take this account when placing layers onto GPUs. We already do
      this for weights that vary per-tensor, so this is a logical extension.
      
      Fixes #9730
      Fixes #9890
      f66216e3
    • Jesse Gross's avatar
      llm: Fix debug logging for memory estimates · f4f0992b
      Jesse Gross authored
      f4f0992b
  3. 13 Mar, 2025 1 commit
  4. 04 Mar, 2025 1 commit
    • Daniel Hiltgen's avatar
      New engine: vision models and auto-fallback (#9113) · 1fdb351c
      Daniel Hiltgen authored
      * Include unified vision layers in memory prediction
      
      For newer vision models with a single gguf, include
      the projection estimates.
      
      * Adjust CLI to handle both styles of vision model metadata
      
      * Wire up new tokenizers for new engine
      
      If we're loading the new engine, utilize the new model
      text processor instead of calling into cgo wrappers for
      llama.cpp.  This also cleans up some tech debt from the
      older tokenization flow for the C++ server which was
      no longer used.
      
      This also adjusts the grammar handling logic to pass
      through to the new engine instead of utilizing the cgo
      schema to grammar call.
      
      * Lay foundation for auto selection of new engine
      1fdb351c
  5. 14 Feb, 2025 1 commit
    • Michael Yang's avatar
      next ollama runner (#7913) · 58245413
      Michael Yang authored
      
      
      feat: add new Ollama engine using ggml through cgo
      
      This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.
      
      - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
      - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
      - `ml.Tensor` defines the interface for a tensor and tensor operations
      
      This is the first implementation of the new engine. Follow up PRs will implement more features:
      
      - non-greedy sampling (#8410)
      - integration with Ollama and KV caching (#8301)
      - more model support (#9080) with more coming soon
      Co-authored-by: default avatarBruce MacDonald <brucewmacdonald@gmail.com>
      58245413
  6. 10 Dec, 2024 1 commit
  7. 04 Dec, 2024 1 commit
  8. 03 Dec, 2024 1 commit
  9. 01 Nov, 2024 1 commit
  10. 18 Oct, 2024 1 commit
  11. 17 Oct, 2024 1 commit
  12. 06 Sep, 2024 1 commit
  13. 05 Sep, 2024 1 commit
    • Daniel Hiltgen's avatar
      Introduce GPU Overhead env var (#5922) · b05c9e83
      Daniel Hiltgen authored
      Provide a mechanism for users to set aside an amount of VRAM on each GPU
      to make room for other applications they want to start after Ollama, or workaround
      memory prediction bugs
      b05c9e83
  14. 20 Jun, 2024 1 commit
  15. 18 Jun, 2024 2 commits
    • Daniel Hiltgen's avatar
      Handle models with divergent layer sizes · 359b15a5
      Daniel Hiltgen authored
      The recent refactoring of the memory prediction assumed all layers
      are the same size, but for some models (like deepseek-coder-v2) this
      is not the case, so our predictions were significantly off.
      359b15a5
    • Daniel Hiltgen's avatar
      Tighten up memory prediction logging · 7784ca33
      Daniel Hiltgen authored
      Prior to this change, we logged the memory prediction multiple times
      as the scheduler iterates to find a suitable configuration, which can be
      confusing since only the last log before the server starts is actually valid.
      This now logs once just before starting the server on the final configuration.
      It also reports what library instead of always saying "offloading to gpu" when
      using CPU.
      7784ca33
  16. 14 Jun, 2024 3 commits
  17. 04 Jun, 2024 2 commits
  18. 24 May, 2024 1 commit
  19. 13 May, 2024 2 commits
  20. 10 May, 2024 1 commit
  21. 08 May, 2024 1 commit
  22. 07 May, 2024 1 commit
  23. 05 May, 2024 1 commit
    • Daniel Hiltgen's avatar
      Centralize server config handling · f56aa200
      Daniel Hiltgen authored
      This moves all the env var reading into one central module
      and logs the loaded config once at startup which should
      help in troubleshooting user server logs
      f56aa200
  24. 01 May, 2024 1 commit
  25. 26 Apr, 2024 1 commit
  26. 25 Apr, 2024 1 commit
  27. 24 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Add back memory escape valve · 5445aaa9
      Daniel Hiltgen authored
      If we get our predictions wrong, this can be used to
      set a lower memory limit as a workaround.  Recent multi-gpu
      refactoring accidentally removed it, so this adds it back.
      5445aaa9
  28. 23 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Request and model concurrency · 34b9db5a
      Daniel Hiltgen authored
      This change adds support for multiple concurrent requests, as well as
      loading multiple models by spawning multiple runners. The default
      settings are currently set at 1 concurrent request per model and only 1
      loaded model at a time, but these can be adjusted by setting
      OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
      34b9db5a