1. 17 Mar, 2025 1 commit
  2. 14 Mar, 2025 3 commits
    • Jesse Gross's avatar
      ollamarunner: Use a separate context per multimodal input · 282bfaaa
      Jesse Gross authored
      Currently there is a single context per sequence, shared all by
      all multimodal inputs. Since we build a vision encoder graph per
      image, with a large number of inputs we can eventually hit the
      maximum number of graph nodes per context.
      
      This changes to use a separate context for each image, ensuring
      that available resource limits are consistent.
      282bfaaa
    • Jesse Gross's avatar
      ml: Allow models to constrain inputs to a single batch · 9679f401
      Jesse Gross authored
      Models may require that a set of inputs all be processed as part
      of the same batch. For example, if an image has multiple patches
      with fully connected attention between them, we should not split
      the batch in the middle of an image.
      
      Fixes #9697
      9679f401
    • Bruce MacDonald's avatar
      llm: remove internal subprocess req and resp types (#9324) · 3892c3a7
      Bruce MacDonald authored
      This commit refactors the LLM subsystem by removing internal subprocess
      request and response types. It consolidates duplicate type definitions
      across the codebase, moving them to centralized locations. The change also
      standardizes interfaces between components, simplifies the ServerStatusResp
      struct, and moves the ParseDurationMs function to a common package. This
      cleanup reduces code duplication between different runner implementations
      (llamarunner and ollamarunner).
      3892c3a7
  3. 13 Mar, 2025 1 commit
  4. 11 Mar, 2025 2 commits
  5. 10 Mar, 2025 2 commits
    • Jeffrey Morgan's avatar
    • Jesse Gross's avatar
      model: Update encoder cache to use multimodal input processing handler · a1cda80b
      Jesse Gross authored
      The encoder cache needs to know the position of images in the input
      stream so that it knows when to delete them. Previously images didn't
      have a position, so we implied one by breaking batches before an
      image and then assuming the image was in the first position. However,
      multimodal objects are now given explicit positions in the input
      stream, so we can use that instead.
      
      Breaking batches was also a way to simulate a cross attention mask
      for mllama. However, given that it only supports a single sequence
      and a single image, this mask doesn't serve any real purpose.
      Removing the batch break does not appear to affect the quality of
      the output.
      
      Most of this is simply moving the input data structures to a new
      package to avoid import cycles.
      a1cda80b
  6. 09 Mar, 2025 1 commit
  7. 08 Mar, 2025 1 commit
  8. 07 Mar, 2025 3 commits
    • Parth Sareen's avatar
      sample: improve ollama engine sampler performance (#9374) · 0682dae0
      Parth Sareen authored
      This change bring in various interface cleanups along with greatly improving the performance of the sampler.
      
      Tested with llama3.2 on local machine.
      Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
      Without topK performance is ~ 110 tokens/s
      0682dae0
    • Jesse Gross's avatar
      ollamarunner: Improve multimodal input handling · a7e63b82
      Jesse Gross authored
      Various vision models have different requirements for how they
      receive their inputs. For example:
       - Mllama wants images together with text and the image embeddings
         don't themselves have positions or get stored in the main KV cache
       - Llava-style models feed in embeddings similar to tokens and
         images correspond to a varying number of tokens in the cache.
      
      In addition, the strategy for providing inputs must support batching
      and multiple sequences, which are managed by the runner. At the same
      time, we want to keep data handling fully in the model so that new
      architectures are not bottlenecked by runner code which does not
      understand their particular requirements.
      
      This provides a method for models to edit the input stream so that
      it meets their needs while still being in a format that the runner
      understands. This allows the runner to avoid special processing
      for different models.
      
      In addition, this fixes a regression where non-vision models may
      try to incorrectly interpret images.
      a7e63b82
    • Jesse Gross's avatar
      model: Don't unconditionally add special tokens · b70fc4d5
      Jesse Gross authored
      We sometimes tokenize partial strings. For example, with
      multimodal inputs, we split the input string around the images
      and then tokenize each piece. In these cases, we should only add
      the special tokens on the first piece.
      b70fc4d5
  9. 04 Mar, 2025 1 commit
    • Michael Yang's avatar
      ml/backend/ggml: consolidate system info logging · 05a01fde
      Michael Yang authored
      - output backend system info when initializing the backend. this ensures
        this information is always present without needing to be called
        explicitly
      - convert to structured logging
      - enumerate devices rather than backends since devices are ordered
      - track device indices grouped by device name
      05a01fde
  10. 02 Mar, 2025 1 commit
    • Jesse Gross's avatar
      ml: Enable support for flash attention · 21aa666a
      Jesse Gross authored
      The GGML flash attention kernel has specific requirements for
      padding and permutation. This adds support to the KV cache
      for conforming to these requirements so that flash attention
      can be enabled.
      
      Flash attention can be used in the same situations as the llama
      engine and is enabled by the user in the same way.
      21aa666a
  11. 28 Feb, 2025 2 commits
  12. 27 Feb, 2025 1 commit
  13. 25 Feb, 2025 1 commit
  14. 20 Feb, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Pass runner performance parameters to backends · bd6a7d5e
      Jesse Gross authored
      Currently the following parameters are in the runner but not used:
       - numGPULayers
       - mainGPU
       - threads
       - tensorSplit
      
      This passes them through to the backend, which is where they would
      actually get used. However, the GGML backend does not yet do anything
      with them.
      bd6a7d5e
  15. 14 Feb, 2025 2 commits
    • Daniel Hiltgen's avatar
      df2680b4
    • Jesse Gross's avatar
      Runner for Ollama engine · ed443a03
      Jesse Gross authored
      This provides integration with the new Ollama engine
      (58245413 next ollama runner (#7913)) and the rest of the Ollama
      infrastructure such as the runner and Ollama server.
      
      In addition, it also builds out the KV cache infrastructure to
      support requirements of how Ollama runs models such as:
       - Parallel processing
       - Memory management for defragmentation and shifting
       - Multi-modal modals
      
      Both old and new engines continue to be supported. By default, only
      the old engine is used. To enable the new engine:
      
      Start the server with the OLLAMA_NEW_ENGINE environment variable set:
      OLLAMA_NEW_ENGINE=1 ./ollama serve
      
      Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
      ./ollama run jessegross/llama3.1
      ed443a03
  16. 08 Jan, 2025 1 commit
  17. 17 Dec, 2024 1 commit
    • Jesse Gross's avatar
      llama: Ensure KV cache is fully defragmented. · 08a832b4
      Jesse Gross authored
      Sometimes the KV cache requires defragmentation even without
      triggering the threshold heuristic. In this case, decoding
      will not being able to find a KV cache slot. This is particularly
      difficult for the caller to handle if it happens in between
      ubatches. To avoid this, we should immediately trigger a defrag.
      
      In addition, a heavily fragmented cache can require more than
      max_moves to defragment. Currently, we stop when we hit the limit
      but this can leave a cache that still does not have adequate space
      even after defragmentation is triggered. Instead, we should do
      multiple batches of processing until everything is complete.
      
      Fixes #7949
      08a832b4
  18. 11 Dec, 2024 1 commit
  19. 10 Dec, 2024 1 commit
    • Daniel Hiltgen's avatar
      build: Make target improvements (#7499) · 4879a234
      Daniel Hiltgen authored
      * llama: wire up builtin runner
      
      This adds a new entrypoint into the ollama CLI to run the cgo built runner.
      On Mac arm64, this will have GPU support, but on all other platforms it will
      be the lowest common denominator CPU build.  After we fully transition
      to the new Go runners more tech-debt can be removed and we can stop building
      the "default" runner via make and rely on the builtin always.
      
      * build: Make target improvements
      
      Add a few new targets and help for building locally.
      This also adjusts the runner lookup to favor local builds, then
      runners relative to the executable, and finally payloads.
      
      * Support customized CPU flags for runners
      
      This implements a simplified custom CPU flags pattern for the runners.
      When built without overrides, the runner name contains the vector flag
      we check for (AVX) to ensure we don't try to run on unsupported systems
      and crash.  If the user builds a customized set, we omit the naming
      scheme and don't check for compatibility.  This avoids checking
      requirements at runtime, so that logic has been removed as well.  This
      can be used to build GPU runners with no vector flags, or CPU/GPU
      runners with additional flags (e.g. AVX512) enabled.
      
      * Use relative paths
      
      If the user checks out the repo in a path that contains spaces, make gets
      really confused so use relative paths for everything in-repo to avoid breakage.
      
      * Remove payloads from main binary
      
      * install: clean up prior libraries
      
      This removes support for v0.3.6 and older versions (before the tar bundle)
      and ensures we clean up prior libraries before extracting the bundle(s).
      Without this change, runners and dependent libraries could leak when we
      update and lead to subtle runtime errors.
      4879a234
  20. 03 Dec, 2024 1 commit
  21. 27 Nov, 2024 1 commit
  22. 26 Nov, 2024 2 commits
    • Jesse Gross's avatar
      runner.go: Don't try to extract image tags for text models · 71e6a0d0
      Jesse Gross authored
      When processing a prompt, we look for image tags of the form
      [img-0], which are inserted by the Ollama server process.
      However, this can cause errors if the original prompt has these
      tags - typically an image not found error is returned.
      
      This changes tag searching behavior to be similar to the 0.3.x
      series, which will largely avoid these problems. However,they can
      still happen when input text with these tags is used with image
      models. The correct solution is to escape the tags but this is a
      larger issue with special sequences in general so this is an
      incremental fix that should avoid the problem for the majority
      of cases.
      71e6a0d0
    • Jesse Gross's avatar
      runner.go: Add unit tests for context shifting · 2cd11ae3
      Jesse Gross authored
      This also makes it easier to truncate long inputs the same as
      shifting but does not actually implement it. This type of
      truncation has a trade off between quality and time to first
      token.
      2cd11ae3
  23. 23 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Fix deadlock with many concurrent requests · 3478b2cf
      Jesse Gross authored
      If there are no avilable slots for new sequences then a request
      will not be added to the processing queue but will continue on
      to wait for a response that never comes. Besides never giving a
      response to the request, this prevents the model from being
      unloaded due to the outstanding request.
      
      To prevent this, there are semaphores that prevent more requests
      from being processed than there are slots - one in the Ollama
      server and one in the runner.
       - The Ollama server one works but it is not designed to protect
      the runner's data internal structures and the runner can return a
      final response before clearing its data structures.
       - The internal runner semaphore has similar behavior where it
       can release the semaphore when it issues a response. This is
       wrong - it should only release the semaphore after it has
       cleared the data structure.
      
      In addition, we should return an error if a slot is not found
      rather than deadlocking in the event we ever get to this spot.
      
      Fixes #7779
      3478b2cf
  24. 22 Nov, 2024 1 commit
    • Daniel Hiltgen's avatar
      logs: explain client aborts better (#7783) · b85520bf
      Daniel Hiltgen authored
      Users get confused by "Failed to acquire semaphore" error="context canceled"
      messages in the logs, which are actually clients giving up.  While there could be
      a legitimate hang bug in the system, sometimes this is just short client timeouts
      with an overloaded system, so this should help users understand what's going on
      better.
      b85520bf
  25. 20 Nov, 2024 5 commits
    • Jesse Gross's avatar
      runner.go: Truncate inputs that exceed context rather than shifting · c4b34f2a
      Jesse Gross authored
      Previous versions of the runner would truncate inputs to the context
      window before beginning processing. The main processing loop relied
      on this behavior if the context needed to be shifted later (due to
      token generation). If truncation did not occur then invariants
      would be broken, causing crashes or infinite loops.
      
      Later versions attempted to fix these bugs and make the logic less
      subtle so that all inputs could be handled. Truncation was removed
      to make things consistent.
      
      However, truncation is much faster than processing and shifting, so
      removing it caused performance problems when the input vastly exceeded
      the context size. This restores the input truncation as a performance
      optimization while keeping the more robust processing logic.
      
      Fixes #7762
      c4b34f2a
    • Jesse Gross's avatar
      runner.go: Don't add inputs to cache view until actually processed · c3ff9164
      Jesse Gross authored
      We need to track which tokens are in the cache ourselves. We currently
      add tokens to the cache tracker when we add them to batch but they are
      not actually in the cache until we call Decode. This can cause
      confusion when we are shifting the cache.
      
      Avoids "could not find a KV slot for the batch" issues.
      
      Bug #7545
      c3ff9164
    • Jesse Gross's avatar
      runner.go: Hard fail on errors rather than potentially infinite looping · 3fc1dc0e
      Jesse Gross authored
      We try to recover from errors by dropping the tokens that caused the
      problem and re-trying. However, dropping the tokens is not correct
      and continuing often leads to infinite loops. To avoid, this we
      end the sequence if such a condition is detected, which is also
      surprising.
      
      At this point, it is better to just report the error. This will make
      it easier to find problems and the alternatives are perhaps even more
      surprising to users.
      
      This is not a very satisfactory solution either - we should isolate
      the error and return it to the user without killing the whole process.
      However, this is an incremental step and consistent with most other
      failures (which either manifest as abort() or panic).
      3fc1dc0e
    • Jesse Gross's avatar
      runner.go: Retry decoding after defragmentation if needed · 7121dfa3
      Jesse Gross authored
      Fragmentation of the KV cache can occur due to cache shifting or
      different sequences getting processed. Decode uses a heuristic to
      decide if it should defrag. However, this heuristic isn't 100%
      accurate, so decoding can sometimes fail by surprise.
      
      For these cases, if decode indicates that there is no KV cache space,
      we should defrag and then try again.
      7121dfa3
    • Jesse Gross's avatar
      runner.go: Use correct index when retrieving embedding results · 5f68fcab
      Jesse Gross authored
      This doesn't have any impact currently because NUM_PARALLEL is forced
      to 1 for embeddings, so both indicies will always be 0.
      5f68fcab
  26. 15 Nov, 2024 2 commits
    • Jesse Gross's avatar
      runner.go: Propagate panics back to the user. · d875e99e
      Jesse Gross authored
      This is a partial revert of 8a35bb92
      "runner.go: Increase survivability of main processing loop", removing
      the panic handler.
      
      Although we want to avoid errors taking down the runner, we also
      should make the user aware of problems when they happen. In the
      future, we can restructure things so both parts are true.
      d875e99e
    • Jesse Gross's avatar
      runner.go: Increase survivability of main processing loop · 8a35bb92
      Jesse Gross authored
      Currently, if an error occurs during the prep stages (such as
      tokenizing) of a single request, it will only affect that request.
      However, if an error happens during decoding, it can take down the
      entire runner.
      
      Instead, it's better to drop the tokens that triggered the error and try to
      keep going. However, we also need to stop when we run out of tokens,
      otherwise, this just causes an infinite loop. This is likely the cause
      of at least some of the hanging issues that have been reported.
      
      Bug #7573
      8a35bb92