- 02 Apr, 2025 2 commits
-
-
Bruce MacDonald authored
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy. -
IsAurora6 authored
-
- 01 Apr, 2025 4 commits
-
-
Bruce MacDonald authored
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
-
Ilian authored
-
Abyss-c0re authored
Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
湛露先生 authored
Signed-off-by:zhanluxianshen <zhanluxianshen@163.com>
-
- 31 Mar, 2025 5 commits
-
-
Bruce MacDonald authored
Clear KV cache when shift operation is not supported by model. Added KvCacheCanShift() check to handle models that can't perform cache shifts, falling back to full cache clear while preserving logical token history to maintain expected behavior when context window fills up.
-
Blake Mizerany authored
This change adds tracking of download chunks during the pull process so that subsequent pulls can skip downloading already completed chunks. This works across restarts of ollama. Currently, download state will be lost if a prune is triggered during a pull (e.g. restart or remove). This issue should be addressed in a follow-up PR.
-
Jesse Gross authored
If we have an error after creating a new sequence but before finding a slot for it, we return without releasing the semaphore. This reduces our parallel sequences and eventually leads to deadlock. In practice this should never happen because once we have acquired the semaphore, we should always be able to find a slot. However, the code is clearly not correct.
-
Jesse Gross authored
With the llama runner, we can generate up to NUM_PARALLEL batches at once, which will then get broken up to into individual batches to get executed by llama.cpp (i.e. we add up to 2048 tokens and this gets split into 4 batches of 512 tokens at default settings). This splitting can improve parallelism on multi-GPU systems because the individual batches can move though the pipeline without blocking on the first one to fully complete. However, we don't yet support this in the Ollama runner, partially because it makes it hard to enforce model-specified batch constraints, which didn't exist previously. The result is that we will try to execute the full, unsplit batch. This could result in out of memory or insufficient KV cache space errors. This triggers batch breaking when the total inputs from all sequences exceeds the batch size, rather than per-sequence. In order to ensure fairness, it also reintroduces round-robinning around sequences so that we don't let one busy sequence starve the others.
-
Leandro Borges Ferreira authored
-
- 28 Mar, 2025 1 commit
-
-
CYJiang authored
Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 27 Mar, 2025 3 commits
-
-
Jesse Gross authored
Model implementations should use Input for all of their tensors supplied to the model. This includes tensors that relate to the outputs, which is confusing since there is also an Output funciton. Since Output is only used internally in GGML and not used by any model implementations, we can remove it from the interface to reduce confusion.
-
saman-amd authored
-
Parth Sareen authored
-
- 26 Mar, 2025 5 commits
-
-
molbal authored
-
Hengky Steen authored
-
Jesse Gross authored
Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890
-
Jesse Gross authored
-
Jesse Gross authored
When computing the size of the cache for sliding window attention, we don't need to multiple the batch size by the number of parallel sequences - the batch size is constant. This also simplifies the check for whether to allocate the cache size based on capacity or window size as the batch size is already incorporated into the capacity when handled by the runner.
-
- 25 Mar, 2025 1 commit
-
-
copeland3300 authored
-
- 24 Mar, 2025 1 commit
-
-
Matheus C. França authored
-
- 21 Mar, 2025 12 commits
-
-
Blake Mizerany authored
Close chunked writers as soon as downloads complete, rather than deferring closure until Pull exits. This prevents exhausting file descriptors when pulling many layers. Instead of unbounded defers, use a WaitGroup and background goroutine to close each chunked writer as soon as its downloads finish. Also rename 'total' to 'received' for clarity.
-
Michael Yang authored
ml/backend/ggml: load tensors in 128KiB chunks
-
Michael Yang authored
-
Bruce MacDonald authored
-
Blake Mizerany authored
-
Parth Sareen authored
This reverts commit ffbfe833.
-
Parth Sareen authored
-
Jesse Gross authored
Currently sliding window attention allocates and uses the full context size and just masks out any tokens that are outside of the window. However, we really only need (roughly) the sliding window size. At large context sizes this improves two things: - Memory allocated - since the fully context size is allocated up front, memory requirements drop substantially. On Gemma3:4b with a 32k context window, total memory usage (including weights and non-sliding layers) drops from ~20GB to ~8GB. - Computation - ranges that are completely outside of the sliding window are now removed from the tensors that are returned from the cache rather than simply being masked out. This results in more efficient processing, scaling with the size of the context that has actually been used. Notable, this does not update the scheduler for any model to be aware of the smaller memory requirements. This is difficult for Gemma3 because the layers are heterogeneous between sliding and non-sliding attention. As a result, while actual memory consumption will be reduced, the scheduler will over-estimate the requirements of the model. This means that splitting between GPUs or GPUs and CPUs will still be suboptimal. Bug #9730
-
Jesse Gross authored
Currently the runner computes the kv size needed and creates a cache of that size. This is the context size times number of parallel sequences. Cache implementations can make better decisions about their memory usage, so instead pass in the required capacity, number of sequences and maximum batch size. For now, the causal cache just uses this to compute the size in the same way as before.
-
Patrick Devine authored
-
Jesse Gross authored
This enables the runner to report progress back to the Ollama server, both for showing status to the user and also to prevent the server from killing the runner if it thinks things have stalled. Most of the infrastructure was already there, this extends it to be available to the backends.
-
Jesse Gross authored
Defragging the KV cache can generate a lot of operations, so we need to be careful that we don't overflow the number that the graph can support. We currently account for all of the nodes that we add to the graph for each move but we also need to include the original cache tensors as well. Fixes #9904
-
- 20 Mar, 2025 6 commits
-
-
Jesse Gross authored
Rather than directly giving the input data to models, we can pass a tensor instead. In the short term, this saves some duplicated code. Longer term, we will want to overlap setting up the next batch with processing of the current one. In this case, we will only have the shape of tensor but it will not be loaded with data at the time of graph generation. By passing only a tensor to models now, we set up this possibility and prevent them from relying on data that they won't have in the future. Although the same could be done for Positions and Outputs, in some cases we either need the raw input data or don't use them at all. Therefore, for now we leave them as they are and allow models to convert them to tensors as needed.
-
Jesse Gross authored
Options is no longer very descriptive of this struct.
-
rylativity authored
* updates parser/parser.go to allow arbitrary roles in Modelfile MESSAGE blocks
-
Parth Sareen authored
-
Patrick Devine authored
This change allows the gemma3 template to be autodetected during `ollama create`.
-
Jesse Gross authored
Looks like a merge conflict that broke the model.
-