1. 07 Mar, 2025 8 commits
    • Michael Yang's avatar
      kvcache: create cache ctx per layer · 764e199d
      Michael Yang authored
      each cache layer creates and maintains its own context instead of using
      a large context for all layers
      764e199d
    • Michael Yang's avatar
      model: load non-repeated tensors into multiple backends · bfce55db
      Michael Yang authored
      some tensors are expected to be used in repeating layers but are not
      themselves repeated. this change copies these tensors into the same
      backends as their repeating counterparts to minimize copying tensors
      between backends
      bfce55db
    • Michael Yang's avatar
      ml/backend/ggml: update model loading for hybrid/multi backends · bab6f34d
      Michael Yang authored
      use a similar strategy as llama.cpp for deciding where tensors should be
      allocated. this will be improved later to be aware of usable memory
      before assigning the tensor
      bab6f34d
    • Parth Sareen's avatar
      sample: improve ollama engine sampler performance (#9374) · 0682dae0
      Parth Sareen authored
      This change bring in various interface cleanups along with greatly improving the performance of the sampler.
      
      Tested with llama3.2 on local machine.
      Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
      Without topK performance is ~ 110 tokens/s
      0682dae0
    • Breaker's avatar
      1f6986e9
    • Jeffrey Morgan's avatar
    • Jesse Gross's avatar
      ollamarunner: Improve multimodal input handling · a7e63b82
      Jesse Gross authored
      Various vision models have different requirements for how they
      receive their inputs. For example:
       - Mllama wants images together with text and the image embeddings
         don't themselves have positions or get stored in the main KV cache
       - Llava-style models feed in embeddings similar to tokens and
         images correspond to a varying number of tokens in the cache.
      
      In addition, the strategy for providing inputs must support batching
      and multiple sequences, which are managed by the runner. At the same
      time, we want to keep data handling fully in the model so that new
      architectures are not bottlenecked by runner code which does not
      understand their particular requirements.
      
      This provides a method for models to edit the input stream so that
      it meets their needs while still being in a format that the runner
      understands. This allows the runner to avoid special processing
      for different models.
      
      In addition, this fixes a regression where non-vision models may
      try to incorrectly interpret images.
      a7e63b82
    • Jesse Gross's avatar
      model: Don't unconditionally add special tokens · b70fc4d5
      Jesse Gross authored
      We sometimes tokenize partial strings. For example, with
      multimodal inputs, we split the input string around the images
      and then tokenize each piece. In these cases, we should only add
      the special tokens on the first piece.
      b70fc4d5
  2. 05 Mar, 2025 2 commits
    • Blake Mizerany's avatar
      server/internal/registry: take over pulls from server package (#9485) · e2252d0f
      Blake Mizerany authored
      This commit replaces the old pull implementation in the server package
      with the new, faster, more robust pull implementation in the registry
      package.
      
      The new endpoint, and now the remove endpoint too, are behind the
      feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
      environment variable include "client2".
      
      Currently, the progress indication is wired to perform the same as the
      previous implementation to avoid making changes to the CLI, and because
      the status reports happen at the start of the download, and the end of
      the write to disk, the progress indication is not as smooth as it could
      be. This is a known issue and will be addressed in a future change.
      
      This implementation may be ~0.5-1.0% slower in rare cases, depending on
      network and disk speed, but is generally MUCH faster and more robust
      than the its predecessor in all other cases.
      e2252d0f
    • Daniel Hiltgen's avatar
      Win: doc new rocm zip file (#9367) · cae5d4d4
      Daniel Hiltgen authored
      To stay under the 2G github artifact limit, we're splitting ROCm
      out like we do on linux.
      cae5d4d4
  3. 04 Mar, 2025 6 commits
    • Michael Yang's avatar
      ml/backend/ggml: consolidate system info logging · 05a01fde
      Michael Yang authored
      - output backend system info when initializing the backend. this ensures
        this information is always present without needing to be called
        explicitly
      - convert to structured logging
      - enumerate devices rather than backends since devices are ordered
      - track device indices grouped by device name
      05a01fde
    • aritra saha's avatar
      docs: add granite-3.2 to the readme · 8fe6f69f
      aritra saha authored
      8fe6f69f
    • Daniel Hiltgen's avatar
      New engine: vision models and auto-fallback (#9113) · 1fdb351c
      Daniel Hiltgen authored
      * Include unified vision layers in memory prediction
      
      For newer vision models with a single gguf, include
      the projection estimates.
      
      * Adjust CLI to handle both styles of vision model metadata
      
      * Wire up new tokenizers for new engine
      
      If we're loading the new engine, utilize the new model
      text processor instead of calling into cgo wrappers for
      llama.cpp.  This also cleans up some tech debt from the
      older tokenization flow for the C++ server which was
      no longer used.
      
      This also adjusts the grammar handling logic to pass
      through to the new engine instead of utilizing the cgo
      schema to grammar call.
      
      * Lay foundation for auto selection of new engine
      1fdb351c
    • Blake Mizerany's avatar
      server/internal/registry: reintroduce pruning on model deletion (#9489) · 7a01ad76
      Blake Mizerany authored
      This reintroduces aggressive pruning on model deletion as a temporary
      measure until a more controlled garbage collection (GC) mechanism is
      implemented.
      
      Issues with the current approach:
      
      1. Users may accidentally delete a model (`ollama rm llama3.3` instead
         of `ollama rm llama3.2`), requiring a full re-download unless another
         model references the same blobs.
      
      2. Users may assume a deleted model is still referenced elsewhere, but
         due to prior updates or deletions, the references no longer exist,
         leading to unnecessary re-downloads.
      
      Soon, we should implement a structured GC mechanism to retain
      unreferenced blobs for a configurable period before removal, which will
      run on "ollama rm" and other commands we deem appropriate.
      
      Users that want to immediately remove unreferenced blobs can use a new
      prune command that will allow them to specify the age and class of blobs
      to remove.
      
      Example usage:
      
          # Run basic blob GC
          $ ollama prune
      
          # Remove unreferenced blobs older than 7 days
          $ ollama prune --age 7d
      
          # Remove all blobs, referenced or not, older than 7 days (and their manifests?)
          $ ollama prune --age 7d --all
      
          # Remove all unreferenced blobs immediately
          $ ollama prune --age 0 --all
      
          # Remove all blobs
          $ ollama prune --age 0 --all
      
      This should provide a safer and more predictable cleanup process.
      7a01ad76
    • Blake Mizerany's avatar
      server/.../backoff,syncs: don't break builds without synctest (#9484) · 55ab9f37
      Blake Mizerany authored
      Previously, developers without the synctest experiment enabled would see
      build failures when running tests in some server/internal/internal
      packages using the synctest package. This change makes the transition to
      use of the package less painful but guards the use of the synctest
      package with build tags.
      
      synctest is enabled in CI. If a new change will break a synctest
      package, it will break in CI, even if it does not break locally.
      
      The developer docs have been updated to help with any confusion about
      why package tests pass locally but fail in CI.
      55ab9f37
    • KindBrave's avatar
      fefbf8f7
  4. 03 Mar, 2025 9 commits
  5. 02 Mar, 2025 7 commits
    • Blake Mizerany's avatar
      server/internal/client/ollama: handle extended names in client/ollama (#9454) · ee048b76
      Blake Mizerany authored
      The extended name format is a superset of the name format that only the
      client needs to know about, not the server or other dependents of the
      name package, so move the split logic into the client package.
      
      Also, take advantage of knowing about the extended name format to allow
      the client to use the extended name format when unlinking to verify they
      are unlinking the manifest with the content they intend.
      ee048b76
    • Soulter's avatar
      af68d60a
    • Jesse Gross's avatar
      ml: Enable support for flash attention · 21aa666a
      Jesse Gross authored
      The GGML flash attention kernel has specific requirements for
      padding and permutation. This adds support to the KV cache
      for conforming to these requirements so that flash attention
      can be enabled.
      
      Flash attention can be used in the same situations as the llama
      engine and is enabled by the user in the same way.
      21aa666a
    • Jesse Gross's avatar
      ml: Empty tensor constructor for tensors · ee141cc8
      Jesse Gross authored
      In cases where we allocate a tensor and then fully overwrite it with
      copied data, it is wasteful to first zero out the memory.
      ee141cc8
    • Jesse Gross's avatar
      ggml-backend: Store parent backend as part of tensor · 55e5776c
      Jesse Gross authored
      It can be important for a tensor to know what backend it came from -
      for example, to know if flash attention is enabled.
      55e5776c
    • Jesse Gross's avatar
      attention: Remove unnecessary contiguous operations · 854a9195
      Jesse Gross authored
      Prior to performing attention, we need to permute query, key
      and value. Currently we call Contiguous after each of these
      permutations, which is correct but expensive. Avoiding the
      3 calls to Contiguous increases performance by over 20%.
      
      The permutations of query and key do not violate the continuity
      rules for mulmat and the Contiguous call can be simply removed.
      
      Value requires a different permutation and does require Contiguous.
      However, we can use the copy into the cache as a way to perform this
      without further overhead.
      
      To support this and avoid unexpected tensor shapes that are seen by
      models, we need tighter integration between attention, cache
      and backend. Future optimization will also likely need this structure
       - for example, flash attention has special padding requirements in
      the cache and other backends may have their own needs.
      
      This further contains the operations that go into attention so that
      these and other optimizations can be handled transparently. Models
      that have special requirements for attention can still implement
      their own version of it.
      854a9195
    • Jeffrey Morgan's avatar
  6. 01 Mar, 2025 3 commits
    • Jeffrey Morgan's avatar
      e75c6126
    • Blake Mizerany's avatar
      server/internal/internal/names: validate names (#9400) · cda6f5c6
      Blake Mizerany authored
      This commit is a step towards a goal to make names less ceremonial
      outside of the registry client. Clients of the registry package can
      treat names as opaque strings, and the registry package will handle
      parsing, validating, and normalizing names.
      
      Ideally we end up with the names package tucked away in an internal
      package for good. We'll see how things go.
      
      Also, this package name is not permanent. This another step in the
      on-going process of refactoring the server code, and at some point it
      will most likely be renamed/moved.
      cda6f5c6
    • Bruce MacDonald's avatar
      server: validate local path on safetensor create (#9379) · bebb6823
      Bruce MacDonald authored
      More validation during the safetensor creation process.
      Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
      Add comprehensive test coverage for various paths
      No functionality changes for valid inputs - existing workflows remain unaffected
      Leverages Go 1.24's new os.Root functionality for secure containment
      bebb6823
  7. 28 Feb, 2025 5 commits