- 30 Oct, 2025 2 commits
-
-
Jesse Gross authored
When a model is partially offloaded to system RAM, we can either do the calculations on the CPU or we can temporarily transfer the data to the GPU to do the calculations there. Small batches tend to be better on the CPU, large batches on the GPU. The llamarunner used the GPU in most cases and the ollamarunner used the CPU. Although the ollamarunner saw an improvement in token generation performance, there was a large performance hit in prompt processing (3-10x). There is an existing heuristic to dynamically switch between these two modes but in practice it doesn't have enough information to accurately make that decision. This adds authoritative data to make the check work to get the best of both worlds. Fixes #12037
-
Jesse Gross authored
We currently allocate the worst case batch for max sized batches, which corresponds to prompt processing. However, there are some cases where the generated graph is different for small and large batches. To ensure that we don't need to allocate memory later after layout has taken place, we should run the worst case batch both ways and take the larger amount of memory. This does not noticeably affect loading speed as the most expensive part of this logic is from image processing and that does not occur during token generation.
-
- 29 Oct, 2025 1 commit
-
-
Michael Yang authored
-
- 28 Oct, 2025 2 commits
-
-
Patrick Devine authored
This reverts commit 5d347f6d.
-
Michael Yang authored
-
- 27 Oct, 2025 1 commit
-
-
nicole pardal authored
Currently, checking the length of prompts for embeddings to ensure they fit in the context window (and possible truncation) occurs in two places - the Ollama server and runner. This can lead to inconsistencies in both the checks and reported number of tokens processed. Since we have to do this processing in the runner, this consolidates all of the logic there.
-
- 20 Oct, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 11 Oct, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 10 Oct, 2025 1 commit
-
-
Michael Yang authored
hardErrCh will deadlock since forwardBatch is blocked on computeStartedCh which never gets sent. since the response to hardErrCh is to panic, just panic instead
-
- 09 Oct, 2025 3 commits
-
-
Michael Yang authored
-
Jeffrey Morgan authored
This reverts commit 6a62b894.
-
Jeffrey Morgan authored
-
- 01 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.
-
- 17 Sep, 2025 1 commit
-
-
russcoss authored
Signed-off-by:russcoss <russcoss@outlook.com>
-
- 16 Sep, 2025 1 commit
-
-
Michael Yang authored
* cleanup * use pooling.TypeNone * pooling test
-
- 15 Sep, 2025 1 commit
-
-
Michael Yang authored
this cleans up the model interface slightly without too much impact in other areas
-
- 12 Sep, 2025 2 commits
-
- 11 Sep, 2025 1 commit
-
-
Jesse Gross authored
Allocation failures can be a normal part of new memory estimates, so we shouldn't print a stack trace in this case.
-
- 10 Sep, 2025 1 commit
-
-
Parth Sareen authored
-
- 09 Sep, 2025 1 commit
-
-
Jesse Gross authored
The context must always be able to store the current batch, so if the user requests a small context then we should also shrink the batch to match. This also fixes the TestLongInputContext test on the new engine. (The old engine already has this behavior.)
-
- 08 Sep, 2025 2 commits
-
-
Parth Sareen authored
-
Michael Yang authored
-
- 04 Sep, 2025 2 commits
-
-
Michael Yang authored
* ollama: add embeddings
-
Michael Yang authored
-
- 29 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.
-
- 22 Aug, 2025 1 commit
-
-
zoupingshi authored
Signed-off-by:zoupingshi <hangfachang@outlook.com>
-
- 14 Aug, 2025 1 commit
-
-
Jesse Gross authored
This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.
-
- 08 Aug, 2025 1 commit
-
-
Jesse Gross authored
In order to iteratively find the best memory allocation, we need to be able to free backend memory so we can try again.
-
- 22 May, 2025 2 commits
-
-
Jesse Gross authored
FromFloatSlice and FromIntSlice return an error if the shape doesn't match the passed data or if memory can't be allocated. Since these are inputs, the memory being allocated is system memory rather than VRAM. In many cases, the caller can't really handle the error and panics. Empty and Zeros directly panic if they can't allocate memory. This makes things consistent by panicing for the first two cases, removing a fair amount of error handling code. This is also consistent with how Go typically handles these situations.
-
Jesse Gross authored
This provides granular information about the backend memory allocations required by the runner: - Per backend - Per layer - Weights, cache and graph - Allocation status This can be used for debugging and validating memory estimates.
-
- 19 May, 2025 1 commit
-
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 15 May, 2025 3 commits
-
-
Jesse Gross authored
We currently preallocate compute graph memory for the worst case batch of text tokens. This adds support for doing the same for images. Note that image models are more complicated than text models in how they process their inputs so there may be cases where this approach isn't completely generic for all models. It covers all currently supported models though.
-
Jesse Gross authored
For some multimodal models (such as gemma3), we create a single graph that generates the image embedding and then use this in the text model. The embedding tensor is completely opaque to the runner. However, this doesn't work if we need to use the embedding in multiple batches. This can arise if the embedding is larger than the batch size. In these cases (as with llama4), we would like to create views that are more appropriately sized. However, if we do this then the original source tensor is used in multiple graphs, which isn't allowed. To avoid that problem, models with this pattern compute the embedding tensor on first use and recreate the individual views. There is no longer a single vision and text graph. This codifies the pattern of separating vision and text graphs. The logic of computing tensors on demand is moved to the runner, so models no longer have to worry about this. It also gives the runner visibility into the multimodal tensors, which is important for memory management.
-
Jesse Gross authored
When we restore a sequence from the cache, we split the prompt into the already used tokens (stored in the cache) and new tokens that need to be processed. Currently, the references to the used tokens are coming from the stored previous sequence. However, even though we know that the used tokens are semantically equivalent to the prefix of the prompt, tokens can contain pointers which are no longer valid. As a result, it is better to get the used tokens from the prompt, which has currently valid pointers. This doesn't currently have any impact because it isn't possible to reuse the pointers (which are tensors) anyways. However, it becomes an issue once we can.
-
- 12 May, 2025 1 commit
-
-
Michael Yang authored
reduce prompt log to trace level
-
- 08 May, 2025 1 commit
-
-
Jesse Gross authored
The correct constant to remove all entries to the end of the sequence for the Ollama engine is math.MaxInt32. -1 is used by the old engine. The impact of this is currently minimal because it would only occur in situations that are not supported by the implemented models or rarely used options.
-
- 05 May, 2025 1 commit
-
-
Jeffrey Morgan authored
Some options listed in api/types.go are not supported in newer models, or have been deprecated in the past. This is the first of a series of PRs to clean up the API options
-
- 02 May, 2025 1 commit
-
-
Jesse Gross authored
Worst case graph preallocation was disabled by a27462b7 "ollamarunner: Temporarily disable worst case graph preallocation" since it caused crashes with large batches when not using the GPU. This backports upstream llama.cpp commit f057808 "ggml: Don't assert fail when tensor data changes (#13222)", which fixes the underlying bug and allows reverting the previous workaround.
-
- 01 May, 2025 1 commit
-
-
Jesse Gross authored
The context (and therefore associated input tensors) was not being properly closed when images were being processed. We were trying to close them but in reality we were closing over an empty list, preventing anything from actually being freed. Fixes #10434
-