Commits · 764e199d6703d80da4d245381efa4a3a412813b2 · OpenDAS / ollama

07 Mar, 2025 1 commit

kvcache: create cache ctx per layer · 764e199d

Michael Yang authored Feb 25, 2025

each cache layer creates and maintains its own context instead of using
a large context for all layers

764e199d

02 Mar, 2025 2 commits

ml: Empty tensor constructor for tensors · ee141cc8

Jesse Gross authored Feb 28, 2025

In cases where we allocate a tensor and then fully overwrite it with
copied data, it is wasteful to first zero out the memory.

ee141cc8

attention: Remove unnecessary contiguous operations · 854a9195

Jesse Gross authored Feb 22, 2025

Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.

854a9195

27 Feb, 2025 1 commit

ml: update Context.Forward interface · 3e8b8a19

Michael Yang authored Feb 21, 2025

update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute

3e8b8a19

14 Feb, 2025 1 commit

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03