- 11 Mar, 2025 3 commits
-
-
Michael Yang authored
-
Patrick Devine authored
-
Daniel Hiltgen authored
-
- 10 Mar, 2025 10 commits
-
-
Michael Yang authored
fix: pad tensor item if ge zero
-
Michael Yang authored
this produces a nicer output since both positive and negative values produces the same width
-
Vincent Koc authored
-
Parth Sareen authored
-
Michael Yang authored
Better WantedBy declaration
-
frob authored
-
Xiaowei Zhu authored
-
Sam authored
-
Jeffrey Morgan authored
-
Jesse Gross authored
The encoder cache needs to know the position of images in the input stream so that it knows when to delete them. Previously images didn't have a position, so we implied one by breaking batches before an image and then assuming the image was in the first position. However, multimodal objects are now given explicit positions in the input stream, so we can use that instead. Breaking batches was also a way to simulate a cross attention mask for mllama. However, given that it only supports a single sequence and a single image, this mask doesn't serve any real purpose. Removing the batch break does not appear to affect the quality of the output. Most of this is simply moving the input data structures to a new package to avoid import cycles.
-
- 09 Mar, 2025 1 commit
-
-
Jesse Gross authored
It's ok to fail on startup but we shouldn't panic during runtime based on user input. Downgrade the panic to a warning.
-
- 08 Mar, 2025 5 commits
-
-
Jesse Gross authored
Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.
-
Jesse Gross authored
-
Jesse Gross authored
Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.
-
Jesse Gross authored
Models can disable causality for all or part of their processing while continuing to store data in the KV cache.
-
Jesse Gross authored
Debug logging of every token has previously caused test timeouts on slower machines.
-
- 07 Mar, 2025 19 commits
-
-
Jesse Gross authored
-
Michael Yang authored
this ensures the tensor is created on the right buffer type for backends such as cpu
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
temporary until tensor loading can accurately account for vision models
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
some tensors should be created on specific backends to reduce number of copies and improve performance
-
Michael Yang authored
each cache layer creates and maintains its own context instead of using a large context for all layers
-
Michael Yang authored
some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends
-
Michael Yang authored
use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor
-
Parth Sareen authored
This change bring in various interface cleanups along with greatly improving the performance of the sampler. Tested with llama3.2 on local machine. Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled. Without topK performance is ~ 110 tokens/s
-
Breaker authored
-
Jeffrey Morgan authored
-
rekcäH nitraM authored
The problem with default.target is that it always points to the target that is currently started. So if you boot into single user mode or the rescue mode still Ollama tries to start. I noticed this because either tried (and failed) to start all the time during a system update, where Ollama definitely is not wanted.
-
Jesse Gross authored
Various vision models have different requirements for how they receive their inputs. For example: - Mllama wants images together with text and the image embeddings don't themselves have positions or get stored in the main KV cache - Llava-style models feed in embeddings similar to tokens and images correspond to a varying number of tokens in the cache. In addition, the strategy for providing inputs must support batching and multiple sequences, which are managed by the runner. At the same time, we want to keep data handling fully in the model so that new architectures are not bottlenecked by runner code which does not understand their particular requirements. This provides a method for models to edit the input stream so that it meets their needs while still being in a format that the runner understands. This allows the runner to avoid special processing for different models. In addition, this fixes a regression where non-vision models may try to incorrectly interpret images.
-
Jesse Gross authored
We sometimes tokenize partial strings. For example, with multimodal inputs, we split the input string around the images and then tokenize each piece. In these cases, we should only add the special tokens on the first piece.
-
- 05 Mar, 2025 2 commits
-
-
Blake Mizerany authored
This commit replaces the old pull implementation in the server package with the new, faster, more robust pull implementation in the registry package. The new endpoint, and now the remove endpoint too, are behind the feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT environment variable include "client2". Currently, the progress indication is wired to perform the same as the previous implementation to avoid making changes to the CLI, and because the status reports happen at the start of the download, and the end of the write to disk, the progress indication is not as smooth as it could be. This is a known issue and will be addressed in a future change. This implementation may be ~0.5-1.0% slower in rare cases, depending on network and disk speed, but is generally MUCH faster and more robust than the its predecessor in all other cases.
-
Daniel Hiltgen authored
To stay under the 2G github artifact limit, we're splitting ROCm out like we do on linux.
-