- 21 Mar, 2025 2 commits
-
-
Michael Yang authored
-
Jesse Gross authored
This enables the runner to report progress back to the Ollama server, both for showing status to the user and also to prevent the server from killing the runner if it thinks things have stalled. Most of the infrastructure was already there, this extends it to be available to the backends.
-
- 18 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
-
- 17 Mar, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Michael Yang authored
-
- 13 Mar, 2025 1 commit
-
-
shane.xb.qian authored
* macOS has different definition per info from @mxyng
-
- 12 Mar, 2025 1 commit
-
-
shane.xb.qian authored
Signed-off-by:shane.xb.qian <shane.qian@foxmail.com>
-
- 11 Mar, 2025 8 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
- 10 Mar, 2025 1 commit
-
-
Michael Yang authored
this produces a nicer output since both positive and negative values produces the same width
-
- 08 Mar, 2025 2 commits
-
-
Jesse Gross authored
Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.
-
Jesse Gross authored
Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.
-
- 07 Mar, 2025 13 commits
-
-
Jesse Gross authored
-
Michael Yang authored
this ensures the tensor is created on the right buffer type for backends such as cpu
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
temporary until tensor loading can accurately account for vision models
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
some tensors should be created on specific backends to reduce number of copies and improve performance
-
Michael Yang authored
each cache layer creates and maintains its own context instead of using a large context for all layers
-
Michael Yang authored
some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends
-
Michael Yang authored
use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor
-
Jeffrey Morgan authored
-
- 04 Mar, 2025 1 commit
-
-
Michael Yang authored
- output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name
-
- 03 Mar, 2025 1 commit
-
-
Michael Yang authored
expand backend loading error handling to catch more problems and log them instead of panicing
-
- 02 Mar, 2025 4 commits
-
-
Jesse Gross authored
The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.
-
Jesse Gross authored
In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.
-
Jesse Gross authored
It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.
-
Jesse Gross authored
Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.
-
- 27 Feb, 2025 3 commits
-
-
Michael Yang authored
update Context.Forward to accept multiple tensors to match Context.Compute signature update Context.Forward to return Context such that it can be chained with Context.Compute
-
Michael Yang authored
-
Michael Yang authored
-