"vscode:/vscode.git/clone" did not exist on "a7248f6ea8fc277b81916dffb238cdcb1f0d9c58"
  1. 29 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
  2. 22 Aug, 2025 1 commit
  3. 08 Aug, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Support closing backends · 756c78cf
      Jesse Gross authored
      In order to iteratively find the best memory allocation, we need to
      be able to free backend memory so we can try again.
      756c78cf
  4. 15 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Base cached tokens on current prompt · 499ae731
      Jesse Gross authored
      When we restore a sequence from the cache, we split the prompt into
      the already used tokens (stored in the cache) and new tokens that
      need to be processed. Currently, the references to the used tokens
      are coming from the stored previous sequence.
      
      However, even though we know that the used tokens are semantically
      equivalent to the prefix of the prompt, tokens can contain pointers
      which are no longer valid. As a result, it is better to get the
      used tokens from the prompt, which has currently valid pointers.
      
      This doesn't currently have any impact because it isn't possible
      to reuse the pointers (which are tensors) anyways. However, it
      becomes an issue once we can.
      499ae731
  5. 08 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Use correct constant to remove cache entries · 3d9498a4
      Jesse Gross authored
      The correct constant to remove all entries to the end of the sequence
      for the Ollama engine is math.MaxInt32. -1 is used by the old engine.
      
      The impact of this is currently minimal because it would only occur
      in situations that are not supported by the implemented models or
      rarely used options.
      3d9498a4
  6. 02 Apr, 2025 2 commits
    • jmorganca's avatar
      kvcache: Add check for values that fall out of sliding window cache · b4297006
      jmorganca authored
      
      
      The sliding window cache trims entries that are outside the window for
      the latest token. This works when we are extending the cache, such as
      when the conversation continues. However, if we have a partial overlap
      in conversation (including the BOS tokens), then we resume from a past
      point in the conversation and the needed tokens are no longer stored
      in memory. This verifies that the new window overlaps with the old one
      before reusing the cache.
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      b4297006
    • Jesse Gross's avatar
      ollamarunner: Don't truncate a SameBatch · 493385eb
      Jesse Gross authored
      When truncating inputs to the the context window at the beginning of
      a sequence, we remove the minimum amount possible. However, this
      may cause us to truncate to the middle of a set of inputs that
      the model specified should not be split up. To avoid this, we
      need to remove the rest of the partial batch.
      493385eb
  7. 31 Mar, 2025 1 commit
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392
  8. 21 Mar, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Pass granular cache size into implementations · 3ed7ad3a
      Jesse Gross authored
      Currently the runner computes the kv size needed and creates a
      cache of that size. This is the context size times number of
      parallel sequences.
      
      Cache implementations can make better decisions about their memory
      usage, so instead pass in the required capacity, number of sequences
      and maximum batch size. For now, the causal cache just uses this to
      compute the size in the same way as before.
      3ed7ad3a
  9. 17 Mar, 2025 1 commit
  10. 14 Mar, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: remove internal subprocess req and resp types (#9324) · 3892c3a7
      Bruce MacDonald authored
      This commit refactors the LLM subsystem by removing internal subprocess
      request and response types. It consolidates duplicate type definitions
      across the codebase, moving them to centralized locations. The change also
      standardizes interfaces between components, simplifies the ServerStatusResp
      struct, and moves the ParseDurationMs function to a common package. This
      cleanup reduces code duplication between different runner implementations
      (llamarunner and ollamarunner).
      3892c3a7
  11. 10 Mar, 2025 1 commit
    • Jesse Gross's avatar
      model: Update encoder cache to use multimodal input processing handler · a1cda80b
      Jesse Gross authored
      The encoder cache needs to know the position of images in the input
      stream so that it knows when to delete them. Previously images didn't
      have a position, so we implied one by breaking batches before an
      image and then assuming the image was in the first position. However,
      multimodal objects are now given explicit positions in the input
      stream, so we can use that instead.
      
      Breaking batches was also a way to simulate a cross attention mask
      for mllama. However, given that it only supports a single sequence
      and a single image, this mask doesn't serve any real purpose.
      Removing the batch break does not appear to affect the quality of
      the output.
      
      Most of this is simply moving the input data structures to a new
      package to avoid import cycles.
      a1cda80b
  12. 08 Mar, 2025 1 commit
  13. 07 Mar, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Improve multimodal input handling · a7e63b82
      Jesse Gross authored
      Various vision models have different requirements for how they
      receive their inputs. For example:
       - Mllama wants images together with text and the image embeddings
         don't themselves have positions or get stored in the main KV cache
       - Llava-style models feed in embeddings similar to tokens and
         images correspond to a varying number of tokens in the cache.
      
      In addition, the strategy for providing inputs must support batching
      and multiple sequences, which are managed by the runner. At the same
      time, we want to keep data handling fully in the model so that new
      architectures are not bottlenecked by runner code which does not
      understand their particular requirements.
      
      This provides a method for models to edit the input stream so that
      it meets their needs while still being in a format that the runner
      understands. This allows the runner to avoid special processing
      for different models.
      
      In addition, this fixes a regression where non-vision models may
      try to incorrectly interpret images.
      a7e63b82
  14. 14 Feb, 2025 1 commit
    • Jesse Gross's avatar
      Runner for Ollama engine · ed443a03
      Jesse Gross authored
      This provides integration with the new Ollama engine
      (58245413 next ollama runner (#7913)) and the rest of the Ollama
      infrastructure such as the runner and Ollama server.
      
      In addition, it also builds out the KV cache infrastructure to
      support requirements of how Ollama runs models such as:
       - Parallel processing
       - Memory management for defragmentation and shifting
       - Multi-modal modals
      
      Both old and new engines continue to be supported. By default, only
      the old engine is used. To enable the new engine:
      
      Start the server with the OLLAMA_NEW_ENGINE environment variable set:
      OLLAMA_NEW_ENGINE=1 ./ollama serve
      
      Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
      ./ollama run jessegross/llama3.1
      ed443a03
  15. 10 Dec, 2024 1 commit
    • Daniel Hiltgen's avatar
      build: Make target improvements (#7499) · 4879a234
      Daniel Hiltgen authored
      * llama: wire up builtin runner
      
      This adds a new entrypoint into the ollama CLI to run the cgo built runner.
      On Mac arm64, this will have GPU support, but on all other platforms it will
      be the lowest common denominator CPU build.  After we fully transition
      to the new Go runners more tech-debt can be removed and we can stop building
      the "default" runner via make and rely on the builtin always.
      
      * build: Make target improvements
      
      Add a few new targets and help for building locally.
      This also adjusts the runner lookup to favor local builds, then
      runners relative to the executable, and finally payloads.
      
      * Support customized CPU flags for runners
      
      This implements a simplified custom CPU flags pattern for the runners.
      When built without overrides, the runner name contains the vector flag
      we check for (AVX) to ensure we don't try to run on unsupported systems
      and crash.  If the user builds a customized set, we omit the naming
      scheme and don't check for compatibility.  This avoids checking
      requirements at runtime, so that logic has been removed as well.  This
      can be used to build GPU runners with no vector flags, or CPU/GPU
      runners with additional flags (e.g. AVX512) enabled.
      
      * Use relative paths
      
      If the user checks out the repo in a path that contains spaces, make gets
      really confused so use relative paths for everything in-repo to avoid breakage.
      
      * Remove payloads from main binary
      
      * install: clean up prior libraries
      
      This removes support for v0.3.6 and older versions (before the tar bundle)
      and ensures we clean up prior libraries before extracting the bundle(s).
      Without this change, runners and dependent libraries could leak when we
      update and lead to subtle runtime errors.
      4879a234
  16. 26 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Add unit tests for context shifting · 2cd11ae3
      Jesse Gross authored
      This also makes it easier to truncate long inputs the same as
      shifting but does not actually implement it. This type of
      truncation has a trade off between quality and time to first
      token.
      2cd11ae3
  17. 20 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Don't add inputs to cache view until actually processed · c3ff9164
      Jesse Gross authored
      We need to track which tokens are in the cache ourselves. We currently
      add tokens to the cache tracker when we add them to batch but they are
      not actually in the cache until we call Decode. This can cause
      confusion when we are shifting the cache.
      
      Avoids "could not find a KV slot for the batch" issues.
      
      Bug #7545
      c3ff9164
  18. 12 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Make KV entry accounting more robust · 65973ceb
      Jesse Gross authored
      The structure of the accounting for KV cache shifting was carried
      over from the old runner but it now doesn't feel natural with the new
      runner. There are a number of invariants that should hold true but
      are difficult to reason about. There is at least one bug report
      that would imply that the invariants are not holding.
      
      This reduces the number of implicit assumptions and is more forgiving
      of unexpected situations. It also improves behavior around which input
      tokens are kept when truncation occurs.
      
      Bug #7545
      65973ceb
  19. 30 Oct, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Better abstract vision model integration · c826e574
      Jesse Gross authored
      
      
      -Update mllama to take the cross attention state as embeddings in
      a batch, more similar to how Llava handles it. This improves
      integration with the input cache.
      -Pass locations in a prompt for embeddings using tags similar to Llava.
      -Abstract interface to vision models so the main runner accesses Clip
      and Mllama similarly
      Co-authored-by: default avatarMichael Yang <mxyng@pm.me>
      c826e574
  20. 08 Oct, 2024 1 commit
    • Jeffrey Morgan's avatar
      Re-introduce the `llama` package (#5034) · 96efd905
      Jeffrey Morgan authored
      
      
      * Re-introduce the llama package
      
      This PR brings back the llama package, making it possible to call llama.cpp and
      ggml APIs from Go directly via CGo. This has a few advantages:
      
      - C APIs can be called directly from Go without needing to use the previous
        "server" REST API
      - On macOS and for CPU builds on Linux and Windows, Ollama can be built without
        a go generate ./... step, making it easy to get up and running to hack on
        parts of Ollama that don't require fast inference
      - Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners
        takes <5 min on a fast CPU)
      - No git submodule making it easier to clone and build from source
      
      This is a big PR, but much of it is vendor code except for:
      
      - llama.go CGo bindings
      - example/: a simple example of running inference
      - runner/: a subprocess server designed to replace the llm/ext_server package
      - Makefile an as minimal as possible Makefile to build the runner package for
        different targets (cpu, avx, avx2, cuda, rocm)
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      
      * cache: Clear old KV cache entries when evicting a slot
      
      When forking a cache entry, if no empty slots are available we
      evict the least recently used one and copy over the KV entries
      from the closest match. However, this copy does not overwrite
      existing values but only adds new ones. Therefore, we need to
      clear the old slot first.
      
      This change fixes two issues:
       - The KV cache fills up and runs out of space even though we think
         we are managing it correctly
       - Performance gets worse over time as we use new cache entries that
         are not hot in the processor caches
      
      * doc: explain golang objc linker warning (#6830)
      
      * llama: gather transitive dependencies for rocm for dist packaging (#6848)
      
      * Refine go server makefiles to be more DRY (#6924)
      
      This breaks up the monolithic Makefile for the Go based runners into a
      set of utility files as well as recursive Makefiles for the runners.
      Files starting with the name "Makefile" are buildable, while files that
      end with ".make" are utilities to include in other Makefiles.  This
      reduces the amount of nearly identical targets and helps set a pattern
      for future community contributions for new GPU runner architectures.
      
      When we are ready to switch over to the Go runners, these files should
      move to the top of the repo, and we should add targets for the main CLI,
      as well as a helper "install" (put all the built binaries on the local
      system in a runnable state) and "dist" target (generate the various
      tar/zip files for distribution) for local developer use.
      
      * llama: don't create extraneous directories (#6988)
      
      * llama: Exercise the new build in CI (#6989)
      
      Wire up some basic sanity testing in CI for the Go runner.  GPU runners are not covered yet.
      
      * llama: Refine developer docs for Go server (#6842)
      
      This enhances the documentation for development focusing on the new Go
      server.  After we complete the transition further doc refinements
      can remove the "transition" discussion.
      
      * runner.go: Allocate batches for all sequences during init
      
      We should tell the model that we could have full batches for all
      sequences. We already do this when we allocate the batches but it was
      missed during initialization.
      
      * llama.go: Don't return nil from Tokenize on zero length input
      
      Potentially receiving nil in a non-error condition is surprising to
      most callers - it's better to return an empty slice.
      
      * runner.go: Remove stop tokens from cache
      
      If the last token is EOG then we don't return this and it isn't
      present in the cache (because it was never submitted to Decode).
      This works well for extending the cache entry with a new sequence.
      
      However, for multi-token stop sequences, we won't return any of the
      tokens but all but the last one will be in the cache. This means
      when the conversation continues the cache will contain tokens that
      don't overlap with the new prompt.
      
      This works (we will pick up the portion where there is overlap) but
      it causes unnecessary cache thrashing because we will fork the original
      cache entry as it is not a perfect match.
      
      By trimming the cache to the tokens that we actually return this
      issue can be avoided.
      
      * runner.go: Simplify flushing of pending tokens
      
      * runner.go: Update TODOs
      
      * runner.go: Don't panic when processing sequences
      
      If there is an error processing a sequence, we should return a
      clean HTTP error back to Ollama rather than panicing. This will
      make us more resilient to transient failures.
      
      Panics can still occur during startup as there is no way to serve
      requests if that fails.
      Co-authored-by: default avatarjmorganca <jmorganca@gmail.com>
      
      * runner.go: More accurately capture timings
      
      Currently prompt processing time doesn't capture the that it takes
      to tokenize the input, only decoding time. We should capture the
      full process to more accurately reflect reality. This is especially
      true once we start processing images where the initial processing
      can take significant time. This is also more consistent with the
      existing C++ runner.
      
      * runner.go: Support for vision models
      
      In addition to bringing feature parity with the C++ runner, this also
      incorporates several improvements:
       - Cache prompting works with images, avoiding the need to re-decode
         embeddings for every message in a conversation
       - Parallelism is supported, avoiding the need to restrict to one
         sequence at a time. (Though for now Ollama will not schedule
         them while we might need to fall back to the old runner.)
      Co-authored-by: default avatarjmorganca <jmorganca@gmail.com>
      
      * runner.go: Move Unicode checking code and add tests
      
      * runner.go: Export external cache members
      
      Runner and cache are in the same package so the change doesn't
      affect anything but it is more internally consistent.
      
      * runner.go: Image embedding cache
      
      Generating embeddings from images can take significant time (on
      my machine between 100ms and 8s depending on the model). Although
      we already cache the result of decoding these images, the embeddings
      need to be regenerated every time. This is not necessary if we get
      the same image over and over again, for example, during a conversation.
      
      This currently uses a very small cache with a very simple algorithm
      but it is easy to improve as is warranted.
      
      * llama: catch up on patches
      
      Carry forward solar-pro and cli-unicode patches
      
      * runner.go: Don't re-allocate memory for every batch
      
      We can reuse memory allocated from batch to batch since batch
      size is fixed. This both saves the cost of reallocation as well
      keeps the cache lines hot.
      
      This results in a roughly 1% performance improvement for token
      generation with Nvidia GPUs on Linux.
      
      * runner.go: Default to classic input cache policy
      
      The input cache as part of the go runner implemented a cache
      policy that aims to maximize hit rate in both single and multi-
      user scenarios. When there is a cache hit, the response is
      very fast.
      
      However, performance is actually slower when there is an input
      cache miss due to worse GPU VRAM locality. This means that
      performance is generally better overall for multi-user scenarios
      (better input cache hit rate, locality was relatively poor already).
      But worse for single users (input cache hit rate is about the same,
      locality is now worse).
      
      This defaults the policy back to the old one to avoid a regression
      but keeps the new one available through an environment variable
      OLLAMA_MULTIUSER_CACHE. This is left undocumented as the goal is
      to improve this in the future to get the best of both worlds
      without user configuration.
      
      For inputs that result in cache misses, on Nvidia/Linux this
      change improves performance by 31% for prompt processing and
      13% for token generation.
      
      * runner.go: Increase size of response channel
      
      Generally the CPU can easily keep up with handling reponses that
      are generated but there's no reason not to let generation continue
      and handle things in larger batches if needed.
      
      * llama: Add CI to verify all vendored changes have patches (#7066)
      
      Make sure we don't accidentally merge changes in the vendored code
      that aren't also reflected in the patches.
      
      * llama: adjust clip patch for mingw utf-16 (#7065)
      
      * llama: adjust clip patch for mingw utf-16
      
      * llama: ensure static linking of runtime libs
      
      Avoid runtime dependencies on non-standard libraries
      
      * runner.go: Enable llamafile (all platforms) and BLAS (Mac OS)
      
      These are two features that are shown on llama.cpp's system info
      that are currently different between the two runners. On my test
      systems the performance difference is very small to negligible
      but it is probably still good to equalize the features.
      
      * llm: Don't add BOS/EOS for tokenize requests
      
      This is consistent with what server.cpp currently does. It affects
      things like token processing counts for embedding requests.
      
      * runner.go: Don't cache prompts for embeddings
      
      Our integration with server.cpp implicitly disables prompt caching
      because it is not part of the JSON object being parsed, this makes
      the Go runner behavior similarly.
      
      Prompt caching has been seen to affect the results of text completions
      on certain hardware. The results are not wrong either way but they
      are non-deterministic. However, embeddings seem to be affected even
      on hardware that does not show this behavior for completions. For
      now, it is best to maintain consistency with the existing behavior.
      
      * runner.go: Adjust debug log levels
      
      Add system info printed at startup and quiet down noisier logging.
      
      * llama: fix compiler flag differences (#7082)
      
      Adjust the flags for the new Go server to more closely match the
      generate flow
      
      * llama: refine developer docs (#7121)
      
      * llama: doc and example clean up (#7122)
      
      * llama: doc and example clean up
      
      * llama: Move new dockerfile into llama dir
      
      Temporary home until we fully transition to the Go server
      
      * llama: runner doc cleanup
      
      * llama.go: Add description for Tokenize error case
      
      ---------
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarDaniel Hiltgen <dhiltgen@users.noreply.github.com>
      96efd905