- 03 Apr, 2025 2 commits
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
Michael Yang authored
-
- 27 Mar, 2025 2 commits
-
-
Jesse Gross authored
Model implementations should use Input for all of their tensors supplied to the model. This includes tensors that relate to the outputs, which is confusing since there is also an Output funciton. Since Output is only used internally in GGML and not used by any model implementations, we can remove it from the interface to reduce confusion.
-
saman-amd authored
-
- 21 Mar, 2025 2 commits
-
-
Michael Yang authored
-
Jesse Gross authored
This enables the runner to report progress back to the Ollama server, both for showing status to the user and also to prevent the server from killing the runner if it thinks things have stalled. Most of the infrastructure was already there, this extends it to be available to the backends.
-
- 18 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
-
- 17 Mar, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Michael Yang authored
-
- 13 Mar, 2025 1 commit
-
-
shane.xb.qian authored
* macOS has different definition per info from @mxyng
-
- 12 Mar, 2025 1 commit
-
-
shane.xb.qian authored
Signed-off-by:shane.xb.qian <shane.qian@foxmail.com>
-
- 11 Mar, 2025 8 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
- 10 Mar, 2025 1 commit
-
-
Michael Yang authored
this produces a nicer output since both positive and negative values produces the same width
-
- 08 Mar, 2025 2 commits
-
-
Jesse Gross authored
Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.
-
Jesse Gross authored
Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.
-
- 07 Mar, 2025 13 commits
-
-
Jesse Gross authored
-
Michael Yang authored
this ensures the tensor is created on the right buffer type for backends such as cpu
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
temporary until tensor loading can accurately account for vision models
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
some tensors should be created on specific backends to reduce number of copies and improve performance
-
Michael Yang authored
each cache layer creates and maintains its own context instead of using a large context for all layers
-
Michael Yang authored
some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends
-
Michael Yang authored
use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor
-
Jeffrey Morgan authored
-
- 04 Mar, 2025 1 commit
-
-
Michael Yang authored
- output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name
-
- 03 Mar, 2025 1 commit
-
-
Michael Yang authored
expand backend loading error handling to catch more problems and log them instead of panicing
-
- 02 Mar, 2025 3 commits
-
-
Jesse Gross authored
The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.
-
Jesse Gross authored
In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.
-
Jesse Gross authored
It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.
-