Commits · 74bd09652d69c77a4bed34b3afda74c87295115b · OpenDAS / ollama

21 Mar, 2025 2 commits

ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
Michael Yang authored Mar 19, 2025

74bd0965

ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758

Jesse Gross authored Mar 20, 2025

This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.

Most of the infrastructure was already there, this extends it to
be available to the backends.

0ff28758

11 Mar, 2025 4 commits
- use 2d pooling · 63a39406
  Michael Yang authored Mar 11, 2025
  
  63a39406
- set non-causal attention · 0df18004
  Michael Yang authored Mar 07, 2025
  
  0df18004
- add gemma vision encoder · 4b037a97
  Michael Yang authored Mar 06, 2025
  
  4b037a97
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
10 Mar, 2025 1 commit

fix: pad tensor item if ge zero · 9926eae0

Michael Yang authored Mar 07, 2025

this produces a nicer output since both positive and negative values
produces the same width

9926eae0

08 Mar, 2025 1 commit

ml: Add support for quantized KV cache · 4100ed7b

Jesse Gross authored Feb 21, 2025

Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.

4100ed7b

07 Mar, 2025 2 commits

ml/backend/ggml: create tensor on specific backend · 7bae7fa5

Michael Yang authored Feb 25, 2025

some tensors should be created on specific backends to reduce number of
copies and improve performance

7bae7fa5

kvcache: create cache ctx per layer · 764e199d

Michael Yang authored Feb 25, 2025

each cache layer creates and maintains its own context instead of using
a large context for all layers

764e199d

04 Mar, 2025 1 commit

ml/backend/ggml: consolidate system info logging · 05a01fde

Michael Yang authored Feb 28, 2025

- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name

05a01fde

02 Mar, 2025 3 commits

ml: Enable support for flash attention · 21aa666a

Jesse Gross authored Feb 25, 2025

The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.

21aa666a

ml: Empty tensor constructor for tensors · ee141cc8

Jesse Gross authored Feb 28, 2025

In cases where we allocate a tensor and then fully overwrite it with
copied data, it is wasteful to first zero out the memory.

ee141cc8

attention: Remove unnecessary contiguous operations · 854a9195

Jesse Gross authored Feb 22, 2025

Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.

854a9195

27 Feb, 2025 2 commits

ml: update Context.Forward interface · 3e8b8a19

Michael Yang authored Feb 21, 2025

update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute

3e8b8a19

model: add bos token if configured · 53d2990d
Michael Yang authored Feb 26, 2025

53d2990d

21 Feb, 2025 1 commit

ml: Abstract attention out of model definitions · f53f4198

Jesse Gross authored Feb 14, 2025



There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

f53f4198

20 Feb, 2025 1 commit

ollamarunner: Pass runner performance parameters to backends · bd6a7d5e

Jesse Gross authored Feb 20, 2025

Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.

bd6a7d5e

14 Feb, 2025 7 commits

Wire up system info log for new engine (#9123) · df2680b4
Daniel Hiltgen authored Feb 14, 2025

df2680b4

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03

backend: API to support full precision matmul · d773b7d6

Jesse Gross authored Feb 13, 2025

Most tensor backends try to optimize performance by using a lower
precision for matmuls. However, some operations (such as kq) on
some models are sensitive to this and require full precision.

d773b7d6

backend: Support graph computation that does not return an output · 4d4463b2

Jesse Gross authored Feb 03, 2025

There are two cases where we may not have an output after computing:
 - Prompt processing where the length of the input exceeds the batch
   size
 - Internal memory management operations such as cache defrag and shift

4d4463b2

backend: Consistently use int (vs. int64) for tensor shapes · 0e38297f

Jesse Gross authored Feb 03, 2025

Currently there is a mixture of int and int64 used when dealing with
tensor dimensions and shapes, which causes unnecessary conversions -
they all should be the same type.

In general, most interfaces (such as Pytorch) use int64 for
generality but most implementations (such as CUDA) use int32 for
performance. There isn't much benefit to us to being more flexible
than the implementations we are likely to run on.

In addition, as a practical matter, a model with a tensor with a single
dimension larger than 32 bits is unlikely to run on a 32-bit machine.

0e38297f

backend: Don't return an error on Close · 7e13f568

Jesse Gross authored Feb 04, 2025

It is not common to return errors with close/free operations - most
people won't check it and even if they did there's probably not much
that can do. It's better to not give implementations false expectations.

7e13f568

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413