- 11 Apr, 2025 3 commits
-
-
Jesse Gross authored
Allocating (and in particular, freeing) memory from CUDA host buffers is expensive and can cause a significant performance hit if we do it for every token. Using normal system memory avoids this issue and also gives the OS more flexibility to manage it. There is no performance impact from this patch directly (either positive or negative) but it makes a difference once we start freeing memory correctly.
-
Jesse Gross authored
Context is currently mixed between pointer and value receivers. Change this to be all pointer receivers so don't have to reason about whether the things we are updating in the struct will be retained.
-
Jesse Gross authored
Sometimes loading the GGUF file fails with: panic: context canceled This is probably a filesystem error but it doesn't provide any information about what happened.
-
- 08 Apr, 2025 2 commits
-
-
Jesse Gross authored
Currently, the KV cache and graph are lazily allocated as needed. The cache is fully allocated on first use of the corresponding layer whereas the graph grows with the size of the context. This can be an issue if another application allocates more VRAM after we do our calculations - Ollama will crash in the middle of inference. If we instead allocate the maximum needed memory at startup of the runner, we will either succeed or fail at that point rather than at some surprising time in the future. Currently, this only generates a worst case batch for text, which means that vision models may get a partial allocation and continue to lazily allocate the rest.
-
Jesse Gross authored
If there is a CUDA OOM, we currently don't check the return value and will evetually segfault. This checks for the problem and generates a Go error. At the moment, this will still result in a panic but having the error is the first step to being able to handle it more gracefully.
-
- 05 Apr, 2025 1 commit
-
-
Daniel Hipke authored
improves model loading times on network-based filesystems such as GCS fuse by creating a dedicated file descriptor for each section of the file being read, reducing seeking
-
- 03 Apr, 2025 2 commits
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
Michael Yang authored
-
- 27 Mar, 2025 1 commit
-
-
Jesse Gross authored
Model implementations should use Input for all of their tensors supplied to the model. This includes tensors that relate to the outputs, which is confusing since there is also an Output funciton. Since Output is only used internally in GGML and not used by any model implementations, we can remove it from the interface to reduce confusion.
-
- 21 Mar, 2025 1 commit
-
-
Michael Yang authored
-
- 18 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
-
- 17 Mar, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Michael Yang authored
-
- 11 Mar, 2025 8 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
- 08 Mar, 2025 2 commits
-
-
Jesse Gross authored
Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.
-
Jesse Gross authored
Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.
-
- 07 Mar, 2025 12 commits
-
-
Jesse Gross authored
-
Michael Yang authored
this ensures the tensor is created on the right buffer type for backends such as cpu
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
temporary until tensor loading can accurately account for vision models
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
some tensors should be created on specific backends to reduce number of copies and improve performance
-
Michael Yang authored
each cache layer creates and maintains its own context instead of using a large context for all layers
-
Michael Yang authored
some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends
-
Michael Yang authored
use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor
-
- 04 Mar, 2025 1 commit
-
-
Michael Yang authored
- output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name
-
- 02 Mar, 2025 4 commits
-
-
Jesse Gross authored
The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.
-
Jesse Gross authored
In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.
-
Jesse Gross authored
It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.
-
Jesse Gross authored
Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.
-