1. 22 Oct, 2025 1 commit
    • Jesse Gross's avatar
      llamarunner: Record the time for all batches during prompt processing · a8d9c264
      Jesse Gross authored
      Currently, we only record the time for the last batch when processing
      the prompt. This results in unrealistically high numbers for the
      old llama runner.
      
      Before:
      total duration:       31.273112939s
      load duration:        4.97054657s
      prompt eval count:    32768 token(s)
      prompt eval duration: 235.137439ms
      prompt eval rate:     139356.80 tokens/s
      eval count:           1873 token(s)
      eval duration:        18.173182374s
      eval rate:            103.06 tokens/s
      
      After:
      total duration:       30.024798033s
      load duration:        4.758588663s
      prompt eval count:    32768 token(s)
      prompt eval duration: 7.779621548s
      prompt eval rate:     4212.03 tokens/s
      eval count:           1769 token(s)
      eval duration:        17.148014223s
      eval rate:            103.16 tokens/s
      a8d9c264
  2. 20 Oct, 2025 1 commit
  3. 13 Oct, 2025 2 commits
  4. 11 Oct, 2025 1 commit
  5. 09 Oct, 2025 3 commits
    • Michael Yang's avatar
      llamarunner: update metrics · bbbc73d6
      Michael Yang authored
      this change updates how metrics are collected. until now, performance
      metrics, specifically initial input processing and subsequent generation
      durations, were collected by taking the timestamp when creating a new
      sequence, the first token generation, and completing generation. the
      processing duration is taken as first token generation sub sequence
      creation while generation is taken as completing generation sub first
      token generation.
      
      while this approach is an accurate end-to-end metric of processing and
      generation, it's not comparable to other tools which only measure the
      active, i.e. decode, duration.
      
      this change updates the metrics to only capture decode duration so it
      can be more directly compared to other tools
      bbbc73d6
    • Jeffrey Morgan's avatar
      Revert "add truncate and shift parameters (#12519)" (#12545) · 7d965258
      Jeffrey Morgan authored
      This reverts commit 6a62b894.
      7d965258
    • Jeffrey Morgan's avatar
      6a62b894
  6. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
  7. 14 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
  8. 14 May, 2025 1 commit
  9. 12 May, 2025 1 commit
  10. 08 May, 2025 1 commit
  11. 05 May, 2025 1 commit
  12. 03 Apr, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: set done reason at server level (#9830) · e53b3cbd
      Bruce MacDonald authored
      No functional change. Many different done reasons can be set at the runner
      level, so rather than obsuring them we should return them to the server
      process and let it choose what to do with the done reason. This separates
      the API concerns from the runner.
      e53b3cbd
  13. 31 Mar, 2025 2 commits
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392
    • Jesse Gross's avatar
      runner: Release semaphore and improve error messages on failures · b2a46529
      Jesse Gross authored
      If we have an error after creating a new sequence but before
      finding a slot for it, we return without releasing the semaphore.
      This reduces our parallel sequences and eventually leads to deadlock.
      
      In practice this should never happen because once we have acquired
      the semaphore, we should always be able to find a slot. However, the
      code is clearly not correct.
      b2a46529
  14. 14 Mar, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: remove internal subprocess req and resp types (#9324) · 3892c3a7
      Bruce MacDonald authored
      This commit refactors the LLM subsystem by removing internal subprocess
      request and response types. It consolidates duplicate type definitions
      across the codebase, moving them to centralized locations. The change also
      standardizes interfaces between components, simplifies the ServerStatusResp
      struct, and moves the ParseDurationMs function to a common package. This
      cleanup reduces code duplication between different runner implementations
      (llamarunner and ollamarunner).
      3892c3a7
  15. 04 Mar, 2025 1 commit
    • Michael Yang's avatar
      ml/backend/ggml: consolidate system info logging · 05a01fde
      Michael Yang authored
      - output backend system info when initializing the backend. this ensures
        this information is always present without needing to be called
        explicitly
      - convert to structured logging
      - enumerate devices rather than backends since devices are ordered
      - track device indices grouped by device name
      05a01fde
  16. 28 Feb, 2025 1 commit
  17. 27 Feb, 2025 2 commits
  18. 14 Feb, 2025 2 commits
    • Jesse Gross's avatar
      llamarunner: Init GGML before printing system info · 010313bb
      Jesse Gross authored
      We currently print system info before the GGML backends are loaded.
      This results in only getting information about the default lowest
      common denominator runner. If we move up the GGML init then we can
      see what we are actually running.
      
      Before:
      time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24
      
      After:
      time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
      010313bb
    • Jesse Gross's avatar
      Runner for Ollama engine · ed443a03
      Jesse Gross authored
      This provides integration with the new Ollama engine
      (58245413 next ollama runner (#7913)) and the rest of the Ollama
      infrastructure such as the runner and Ollama server.
      
      In addition, it also builds out the KV cache infrastructure to
      support requirements of how Ollama runs models such as:
       - Parallel processing
       - Memory management for defragmentation and shifting
       - Multi-modal modals
      
      Both old and new engines continue to be supported. By default, only
      the old engine is used. To enable the new engine:
      
      Start the server with the OLLAMA_NEW_ENGINE environment variable set:
      OLLAMA_NEW_ENGINE=1 ./ollama serve
      
      Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
      ./ollama run jessegross/llama3.1
      ed443a03
  19. 08 Jan, 2025 1 commit
  20. 17 Dec, 2024 1 commit
    • Jesse Gross's avatar
      llama: Ensure KV cache is fully defragmented. · 08a832b4
      Jesse Gross authored
      Sometimes the KV cache requires defragmentation even without
      triggering the threshold heuristic. In this case, decoding
      will not being able to find a KV cache slot. This is particularly
      difficult for the caller to handle if it happens in between
      ubatches. To avoid this, we should immediately trigger a defrag.
      
      In addition, a heavily fragmented cache can require more than
      max_moves to defragment. Currently, we stop when we hit the limit
      but this can leave a cache that still does not have adequate space
      even after defragmentation is triggered. Instead, we should do
      multiple batches of processing until everything is complete.
      
      Fixes #7949
      08a832b4
  21. 11 Dec, 2024 1 commit
  22. 10 Dec, 2024 1 commit
    • Daniel Hiltgen's avatar
      build: Make target improvements (#7499) · 4879a234
      Daniel Hiltgen authored
      * llama: wire up builtin runner
      
      This adds a new entrypoint into the ollama CLI to run the cgo built runner.
      On Mac arm64, this will have GPU support, but on all other platforms it will
      be the lowest common denominator CPU build.  After we fully transition
      to the new Go runners more tech-debt can be removed and we can stop building
      the "default" runner via make and rely on the builtin always.
      
      * build: Make target improvements
      
      Add a few new targets and help for building locally.
      This also adjusts the runner lookup to favor local builds, then
      runners relative to the executable, and finally payloads.
      
      * Support customized CPU flags for runners
      
      This implements a simplified custom CPU flags pattern for the runners.
      When built without overrides, the runner name contains the vector flag
      we check for (AVX) to ensure we don't try to run on unsupported systems
      and crash.  If the user builds a customized set, we omit the naming
      scheme and don't check for compatibility.  This avoids checking
      requirements at runtime, so that logic has been removed as well.  This
      can be used to build GPU runners with no vector flags, or CPU/GPU
      runners with additional flags (e.g. AVX512) enabled.
      
      * Use relative paths
      
      If the user checks out the repo in a path that contains spaces, make gets
      really confused so use relative paths for everything in-repo to avoid breakage.
      
      * Remove payloads from main binary
      
      * install: clean up prior libraries
      
      This removes support for v0.3.6 and older versions (before the tar bundle)
      and ensures we clean up prior libraries before extracting the bundle(s).
      Without this change, runners and dependent libraries could leak when we
      update and lead to subtle runtime errors.
      4879a234
  23. 03 Dec, 2024 1 commit
  24. 27 Nov, 2024 1 commit
  25. 26 Nov, 2024 2 commits
    • Jesse Gross's avatar
      runner.go: Don't try to extract image tags for text models · 71e6a0d0
      Jesse Gross authored
      When processing a prompt, we look for image tags of the form
      [img-0], which are inserted by the Ollama server process.
      However, this can cause errors if the original prompt has these
      tags - typically an image not found error is returned.
      
      This changes tag searching behavior to be similar to the 0.3.x
      series, which will largely avoid these problems. However,they can
      still happen when input text with these tags is used with image
      models. The correct solution is to escape the tags but this is a
      larger issue with special sequences in general so this is an
      incremental fix that should avoid the problem for the majority
      of cases.
      71e6a0d0
    • Jesse Gross's avatar
      runner.go: Add unit tests for context shifting · 2cd11ae3
      Jesse Gross authored
      This also makes it easier to truncate long inputs the same as
      shifting but does not actually implement it. This type of
      truncation has a trade off between quality and time to first
      token.
      2cd11ae3
  26. 23 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Fix deadlock with many concurrent requests · 3478b2cf
      Jesse Gross authored
      If there are no avilable slots for new sequences then a request
      will not be added to the processing queue but will continue on
      to wait for a response that never comes. Besides never giving a
      response to the request, this prevents the model from being
      unloaded due to the outstanding request.
      
      To prevent this, there are semaphores that prevent more requests
      from being processed than there are slots - one in the Ollama
      server and one in the runner.
       - The Ollama server one works but it is not designed to protect
      the runner's data internal structures and the runner can return a
      final response before clearing its data structures.
       - The internal runner semaphore has similar behavior where it
       can release the semaphore when it issues a response. This is
       wrong - it should only release the semaphore after it has
       cleared the data structure.
      
      In addition, we should return an error if a slot is not found
      rather than deadlocking in the event we ever get to this spot.
      
      Fixes #7779
      3478b2cf
  27. 22 Nov, 2024 1 commit
    • Daniel Hiltgen's avatar
      logs: explain client aborts better (#7783) · b85520bf
      Daniel Hiltgen authored
      Users get confused by "Failed to acquire semaphore" error="context canceled"
      messages in the logs, which are actually clients giving up.  While there could be
      a legitimate hang bug in the system, sometimes this is just short client timeouts
      with an overloaded system, so this should help users understand what's going on
      better.
      b85520bf
  28. 20 Nov, 2024 5 commits
    • Jesse Gross's avatar
      runner.go: Truncate inputs that exceed context rather than shifting · c4b34f2a
      Jesse Gross authored
      Previous versions of the runner would truncate inputs to the context
      window before beginning processing. The main processing loop relied
      on this behavior if the context needed to be shifted later (due to
      token generation). If truncation did not occur then invariants
      would be broken, causing crashes or infinite loops.
      
      Later versions attempted to fix these bugs and make the logic less
      subtle so that all inputs could be handled. Truncation was removed
      to make things consistent.
      
      However, truncation is much faster than processing and shifting, so
      removing it caused performance problems when the input vastly exceeded
      the context size. This restores the input truncation as a performance
      optimization while keeping the more robust processing logic.
      
      Fixes #7762
      c4b34f2a
    • Jesse Gross's avatar
      runner.go: Don't add inputs to cache view until actually processed · c3ff9164
      Jesse Gross authored
      We need to track which tokens are in the cache ourselves. We currently
      add tokens to the cache tracker when we add them to batch but they are
      not actually in the cache until we call Decode. This can cause
      confusion when we are shifting the cache.
      
      Avoids "could not find a KV slot for the batch" issues.
      
      Bug #7545
      c3ff9164
    • Jesse Gross's avatar
      runner.go: Hard fail on errors rather than potentially infinite looping · 3fc1dc0e
      Jesse Gross authored
      We try to recover from errors by dropping the tokens that caused the
      problem and re-trying. However, dropping the tokens is not correct
      and continuing often leads to infinite loops. To avoid, this we
      end the sequence if such a condition is detected, which is also
      surprising.
      
      At this point, it is better to just report the error. This will make
      it easier to find problems and the alternatives are perhaps even more
      surprising to users.
      
      This is not a very satisfactory solution either - we should isolate
      the error and return it to the user without killing the whole process.
      However, this is an incremental step and consistent with most other
      failures (which either manifest as abort() or panic).
      3fc1dc0e
    • Jesse Gross's avatar
      runner.go: Retry decoding after defragmentation if needed · 7121dfa3
      Jesse Gross authored
      Fragmentation of the KV cache can occur due to cache shifting or
      different sequences getting processed. Decode uses a heuristic to
      decide if it should defrag. However, this heuristic isn't 100%
      accurate, so decoding can sometimes fail by surprise.
      
      For these cases, if decode indicates that there is no KV cache space,
      we should defrag and then try again.
      7121dfa3
    • Jesse Gross's avatar
      runner.go: Use correct index when retrieving embedding results · 5f68fcab
      Jesse Gross authored
      This doesn't have any impact currently because NUM_PARALLEL is forced
      to 1 for embeddings, so both indicies will always be 0.
      5f68fcab
  29. 15 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Propagate panics back to the user. · d875e99e
      Jesse Gross authored
      This is a partial revert of 8a35bb92
      "runner.go: Increase survivability of main processing loop", removing
      the panic handler.
      
      Although we want to avoid errors taking down the runner, we also
      should make the user aware of problems when they happen. In the
      future, we can restructure things so both parts are true.
      d875e99e