- 11 Mar, 2025 7 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
- 08 Mar, 2025 2 commits
-
-
Jesse Gross authored
Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.
-
Jesse Gross authored
Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.
-
- 07 Mar, 2025 12 commits
-
-
Jesse Gross authored
-
Michael Yang authored
this ensures the tensor is created on the right buffer type for backends such as cpu
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
temporary until tensor loading can accurately account for vision models
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
some tensors should be created on specific backends to reduce number of copies and improve performance
-
Michael Yang authored
each cache layer creates and maintains its own context instead of using a large context for all layers
-
Michael Yang authored
some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends
-
Michael Yang authored
use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor
-
- 04 Mar, 2025 1 commit
-
-
Michael Yang authored
- output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name
-
- 02 Mar, 2025 4 commits
-
-
Jesse Gross authored
The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.
-
Jesse Gross authored
In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.
-
Jesse Gross authored
It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.
-
Jesse Gross authored
Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.
-
- 27 Feb, 2025 1 commit
-
-
Michael Yang authored
update Context.Forward to accept multiple tensors to match Context.Compute signature update Context.Forward to return Context such that it can be chained with Context.Compute
-
- 21 Feb, 2025 2 commits
-
-
Jesse Gross authored
There are two benefits to doing this: - Provide a library function that models can use, reducing code for each model implementation - Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML. On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal. Co-authored-by:Daniel Hiltgen <daniel@ollama.com>
-
Michael Yang authored
-
- 20 Feb, 2025 2 commits
-
-
Jesse Gross authored
We don't need to create and destroy the GGML scheduler for every context. This introduces extra CPU overhead for every forward pass and extra memory for contexts that don't actually get scheduled (for example, KV caches). We can instead just have one scheduler for the backend and reset it each time we call Compute. This improves token generation performance by 1-2% and removes scheduler create/destroy from profile traces.
-
Jesse Gross authored
Currently the following parameters are in the runner but not used: - numGPULayers - mainGPU - threads - tensorSplit This passes them through to the backend, which is where they would actually get used. However, the GGML backend does not yet do anything with them.
-
- 14 Feb, 2025 9 commits
-
-
Daniel Hiltgen authored
-
Jesse Gross authored
This provides integration with the new Ollama engine (58245413 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1
-
Jesse Gross authored
-
Jesse Gross authored
We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.
-
Jesse Gross authored
Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.
-
Jesse Gross authored
Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.
-
Jesse Gross authored
There are two cases where we may not have an output after computing: - Prompt processing where the length of the input exceeds the batch size - Internal memory management operations such as cache defrag and shift
-
Jesse Gross authored
Currently there is a mixture of int and int64 used when dealing with tensor dimensions and shapes, which causes unnecessary conversions - they all should be the same type. In general, most interfaces (such as Pytorch) use int64 for generality but most implementations (such as CUDA) use int32 for performance. There isn't much benefit to us to being more flexible than the implementations we are likely to run on. In addition, as a practical matter, a model with a tensor with a single dimension larger than 32 bits is unlikely to run on a 32-bit machine.
-
Jesse Gross authored
It is not common to return errors with close/free operations - most people won't check it and even if they did there's probably not much that can do. It's better to not give implementations false expectations.
-