1. 11 Dec, 2025 1 commit
  2. 08 Dec, 2025 1 commit
    • Michael Yang's avatar
      refactor rope · 603ceefa
      Michael Yang authored
      change to a flatter directory structure and group the options with the
      function
      
      update models to call rope in one place
      603ceefa
  3. 02 Dec, 2025 1 commit
  4. 17 Sep, 2025 1 commit
  5. 16 Sep, 2025 1 commit
  6. 20 May, 2025 1 commit
  7. 19 May, 2025 1 commit
  8. 15 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d
  9. 13 May, 2025 1 commit
  10. 25 Apr, 2025 1 commit
  11. 03 Apr, 2025 2 commits
  12. 20 Mar, 2025 2 commits
    • Jesse Gross's avatar
      model: Pass input tensor instead of raw data to models · 0fbfcf3c
      Jesse Gross authored
      Rather than directly giving the input data to models, we can
      pass a tensor instead. In the short term, this saves some duplicated
      code.
      
      Longer term, we will want to overlap setting up the next batch with
      processing of the current one. In this case, we will only have the
      shape of tensor but it will not be loaded with data at the time of
      graph generation. By passing only a tensor to models now, we set up
      this possibility and prevent them from relying on data that they won't
      have in the future.
      
      Although the same could be done for Positions and Outputs, in some
      cases we either need the raw input data or don't use them at all.
      Therefore, for now we leave them as they are and allow models to
      convert them to tensors as needed.
      0fbfcf3c
    • Jesse Gross's avatar
      input: Rename Options to Batch · 0c220935
      Jesse Gross authored
      Options is no longer very descriptive of this struct.
      0c220935
  13. 19 Mar, 2025 1 commit
  14. 11 Mar, 2025 1 commit
  15. 10 Mar, 2025 1 commit
    • Jesse Gross's avatar
      model: Update encoder cache to use multimodal input processing handler · a1cda80b
      Jesse Gross authored
      The encoder cache needs to know the position of images in the input
      stream so that it knows when to delete them. Previously images didn't
      have a position, so we implied one by breaking batches before an
      image and then assuming the image was in the first position. However,
      multimodal objects are now given explicit positions in the input
      stream, so we can use that instead.
      
      Breaking batches was also a way to simulate a cross attention mask
      for mllama. However, given that it only supports a single sequence
      and a single image, this mask doesn't serve any real purpose.
      Removing the batch break does not appear to affect the quality of
      the output.
      
      Most of this is simply moving the input data structures to a new
      package to avoid import cycles.
      a1cda80b
  16. 07 Mar, 2025 2 commits
  17. 04 Mar, 2025 1 commit
    • Daniel Hiltgen's avatar
      New engine: vision models and auto-fallback (#9113) · 1fdb351c
      Daniel Hiltgen authored
      * Include unified vision layers in memory prediction
      
      For newer vision models with a single gguf, include
      the projection estimates.
      
      * Adjust CLI to handle both styles of vision model metadata
      
      * Wire up new tokenizers for new engine
      
      If we're loading the new engine, utilize the new model
      text processor instead of calling into cgo wrappers for
      llama.cpp.  This also cleans up some tech debt from the
      older tokenization flow for the C++ server which was
      no longer used.
      
      This also adjusts the grammar handling logic to pass
      through to the new engine instead of utilizing the cgo
      schema to grammar call.
      
      * Lay foundation for auto selection of new engine
      1fdb351c
  18. 02 Mar, 2025 1 commit
    • Jesse Gross's avatar
      attention: Remove unnecessary contiguous operations · 854a9195
      Jesse Gross authored
      Prior to performing attention, we need to permute query, key
      and value. Currently we call Contiguous after each of these
      permutations, which is correct but expensive. Avoiding the
      3 calls to Contiguous increases performance by over 20%.
      
      The permutations of query and key do not violate the continuity
      rules for mulmat and the Contiguous call can be simply removed.
      
      Value requires a different permutation and does require Contiguous.
      However, we can use the copy into the cache as a way to perform this
      without further overhead.
      
      To support this and avoid unexpected tensor shapes that are seen by
      models, we need tighter integration between attention, cache
      and backend. Future optimization will also likely need this structure
       - for example, flash attention has special padding requirements in
      the cache and other backends may have their own needs.
      
      This further contains the operations that go into attention so that
      these and other optimizations can be handled transparently. Models
      that have special requirements for attention can still implement
      their own version of it.
      854a9195
  19. 27 Feb, 2025 1 commit
  20. 21 Feb, 2025 1 commit
    • Jesse Gross's avatar
      ml: Abstract attention out of model definitions · f53f4198
      Jesse Gross authored
      
      
      There are two benefits to doing this:
       - Provide a library function that models can use, reducing code for
         each model implementation
       - Enables a single place to drop in optimized implementations of
         attention based on the backend or other factors. One is provided for
         GGML.
      
      On CUDA this improves token generation rate by about 3%. It does not
      have a significant effect on Metal.
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      f53f4198
  21. 20 Feb, 2025 1 commit
    • Jesse Gross's avatar
      models: Prune unused outputs earlier in the forward pass · 5c5535c0
      Jesse Gross authored
      Currently Rows is called as the last step in a model computation
      to get the values for the output tokens. However, if we move it
      earlier in the process then we can trim out computations that
      never get used. This is similar to how models are defined in
      llama.cpp.
      
      Changing the model definition in this way improves token generation
      performance by approximately 8%.
      5c5535c0
  22. 14 Feb, 2025 6 commits
    • Jesse Gross's avatar
      Runner for Ollama engine · ed443a03
      Jesse Gross authored
      This provides integration with the new Ollama engine
      (58245413 next ollama runner (#7913)) and the rest of the Ollama
      infrastructure such as the runner and Ollama server.
      
      In addition, it also builds out the KV cache infrastructure to
      support requirements of how Ollama runs models such as:
       - Parallel processing
       - Memory management for defragmentation and shifting
       - Multi-modal modals
      
      Both old and new engines continue to be supported. By default, only
      the old engine is used. To enable the new engine:
      
      Start the server with the OLLAMA_NEW_ENGINE environment variable set:
      OLLAMA_NEW_ENGINE=1 ./ollama serve
      
      Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
      ./ollama run jessegross/llama3.1
      ed443a03
    • Jesse Gross's avatar
      models: Move model into their own directory · 6945617a
      Jesse Gross authored
      This allows there to be a file that is a list of models that is
      not mixed into the runner code.
      6945617a
    • Jesse Gross's avatar
      vocab: Use int32 for special tokens · 7916f550
      Jesse Gross authored
      Special tokens are currently read as uint32 from the model metadata.
      However, all other parts of the system (including the tokenizer) use
      int32 to represent tokens so it is impossible to represent the high
      portion of the unsigned range. For consistency and to avoid casts,
      we should just use int32 everywhere.
      7916f550
    • Jesse Gross's avatar
      backend: API to support full precision matmul · d773b7d6
      Jesse Gross authored
      Most tensor backends try to optimize performance by using a lower
      precision for matmuls. However, some operations (such as kq) on
      some models are sensitive to this and require full precision.
      d773b7d6
    • Jesse Gross's avatar
      backend: Consistently use int (vs. int64) for tensor shapes · 0e38297f
      Jesse Gross authored
      Currently there is a mixture of int and int64 used when dealing with
      tensor dimensions and shapes, which causes unnecessary conversions -
      they all should be the same type.
      
      In general, most interfaces (such as Pytorch) use int64 for
      generality but most implementations (such as CUDA) use int32 for
      performance. There isn't much benefit to us to being more flexible
      than the implementations we are likely to run on.
      
      In addition, as a practical matter, a model with a tensor with a single
      dimension larger than 32 bits is unlikely to run on a 32-bit machine.
      0e38297f
    • Michael Yang's avatar
      next ollama runner (#7913) · 58245413
      Michael Yang authored
      
      
      feat: add new Ollama engine using ggml through cgo
      
      This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.
      
      - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
      - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
      - `ml.Tensor` defines the interface for a tensor and tensor operations
      
      This is the first implementation of the new engine. Follow up PRs will implement more features:
      
      - non-greedy sampling (#8410)
      - integration with Ollama and KV caching (#8301)
      - more model support (#9080) with more coming soon
      Co-authored-by: default avatarBruce MacDonald <brucewmacdonald@gmail.com>
      58245413