- 11 Oct, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 09 Oct, 2025 3 commits
-
-
Daniel Hiltgen authored
* logs: quiet down context canceled on completion If the client closes the connection before Completion finishes, we were logging at error level implying the runner crashed which was misleading. time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled" * quiet down scheduler log error on expected case Since we don't hold the lock while performing memory load calculations, other runners can unload in parallel, so finding no runner to unload is a valid scenario which we shouldn't log at error level.
-
Jeffrey Morgan authored
This reverts commit 6a62b894.
-
Jeffrey Morgan authored
-
- 06 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
This variable isn't currently documented or intended as something the user can override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling this in the subprocess environment which will cause problems with the new bootstrap discovery logic.
-
- 02 Oct, 2025 1 commit
-
-
Jesse Gross authored
As we automatically enable flash attention for more models, there are likely some cases where we get it wrong. This allows setting OLLAMA_FLASH_ATTENTION=0 to disable it, even for models that usually have flash attention.
-
- 01 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.
-
- 30 Sep, 2025 1 commit
-
-
Jesse Gross authored
For each memory allocation we report the size of the (attempted) allocation and whether it succeeded or failed. The latter status reporting proved to be not that useful in practice as systems such as Windows can automatically overflow from VRAM into RAM, resultings in successful allocations even when there isn't enough memory where we wanted. As a result, this information is only used for debug logging, which isn't worthwhile enough for the amount of code. It also isn't fully accurate, as multiple allocations may result in partial failures.
-
- 12 Sep, 2025 2 commits
-
- 11 Sep, 2025 2 commits
-
-
Jesse Gross authored
If a model with a split vision projector is loaded in the Ollama engine, the projector will be ignored and the model will hallucinate a response. Instead, fallback and try to load the model in the llama engine.
-
Jesse Gross authored
New memory estimates (see #11090 for more information) are now enabled automatically for all models running on the Ollama engine, improving both stability and performance through more accurate sizing and allocation. Models running on the llama engine will continue to use the original style of memory estimation.
-
- 10 Sep, 2025 2 commits
-
-
Jesse Gross authored
If flash attention is enabled without KV cache quanitization, we will currently always get this warning: level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""
-
Parth Sareen authored
-
- 09 Sep, 2025 1 commit
-
-
Jesse Gross authored
The context must always be able to store the current batch, so if the user requests a small context then we should also shrink the batch to match. This also fixes the TestLongInputContext test on the new engine. (The old engine already has this behavior.)
-
- 08 Sep, 2025 1 commit
-
-
Parth Sareen authored
-
- 02 Sep, 2025 2 commits
-
-
Michael Yang authored
-
Jesse Gross authored
If a GPU's free memory is less than the reserved amount, we might get an underflow. Since it is an unsigned uint64, we print this as a large number rather than the more correct 0. This only affects logging, the actual layout code already handles this correctly. Bug #12138
-
- 29 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* Always filter devices Avoid crashing on unsupported AMD iGPUs * Remove cuda device filtering This interferes with mixed setups
-
- 26 Aug, 2025 1 commit
-
-
Michael Yang authored
-
- 20 Aug, 2025 1 commit
-
-
Jesse Gross authored
With old memory estimates, it's currently impossible to load more than one model at a time when no GPUs are available. This is because the check for whether we need to evict a model looks to see if all layers of the new model can be loaded onto GPUs, which is never true if there are no GPUs. Before the memory management changes, there was a special code path for CPU-only systems. This problem does not exist with new memory estimates. Fixes #11974
-
- 18 Aug, 2025 1 commit
-
-
Jesse Gross authored
We dump out our best memory estimate after we complete processing for any reason, including errors. This is helpful for finding what what stopped us in error conditions but in some cases we might not have gotten even the first result yet. Fixes #11957
-
- 14 Aug, 2025 1 commit
-
-
Jesse Gross authored
This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.
-
- 23 Jun, 2025 2 commits
-
-
Daniel Hiltgen authored
For smaller context models, make sure we do not exceed the training size.
-
Daniel Hiltgen authored
* Re-remove cuda v11 Revert the revert - drop v11 support requiring drivers newer than Feb 23 This reverts commit c6bcdc42. * Simplify layout With only one version of the GPU libraries, we can simplify things down somewhat. (Jetsons still require special handling) * distinct sbsa variant for linux arm64 This avoids accidentally trying to load the sbsa cuda libraries on a jetson system which results in crashes. * temporary prevent rocm+cuda mixed loading
-
- 29 May, 2025 1 commit
-
-
Jesse Gross authored
"POST predict" basically means that the runner has crashed, which can have many reasons. However, many people think this is a specific error and either report only this message or group together unrelated bugs. This replaces it with a more friendly and helpful message.
-
- 19 May, 2025 1 commit
-
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 14 May, 2025 1 commit
-
-
Michael Yang authored
-
- 13 May, 2025 1 commit
-
-
Daniel Hiltgen authored
Bring back v11 until we can better warn users that their driver is too old. This reverts commit fa393554.
-
- 12 May, 2025 1 commit
-
-
Michael Yang authored
reduce prompt log to trace level
-
- 07 May, 2025 2 commits
-
-
Daniel Hiltgen authored
If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.
-
Daniel Hiltgen authored
This reduces the size of our Windows installer payloads by ~256M by dropping support for nvidia drivers older than Feb 2023. Hardware support is unchanged. Linux default bundle sizes are reduced by ~600M to 1G.
-
- 05 May, 2025 1 commit
-
-
Jeffrey Morgan authored
Some options listed in api/types.go are not supported in newer models, or have been deprecated in the past. This is the first of a series of PRs to clean up the API options
-
- 03 May, 2025 2 commits
-
-
Daniel Hiltgen authored
For all search path env vars make sure our dirs are first to avoid potentially finding other incompatible libraries on the users system. Also fixes a minor build script glitch for windows rocm
-
Daniel Hiltgen authored
This enhances our logging in the scheduler. The initial "waiting for server" log no longer claims an initial error state (now "not responding" which better reflects the actual state). Runners now have slog wiring to report more details about the runner, including PID.
-
- 30 Apr, 2025 1 commit
-
-
Daniel Hiltgen authored
Users may have other incompatible GGML installs on their systems. This will prevent us from trying to load them from the path.
-
- 24 Apr, 2025 1 commit
-
-
Parth Sareen authored
-
- 03 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
No functional change. Many different done reasons can be set at the runner level, so rather than obsuring them we should return them to the server process and let it choose what to do with the done reason. This separates the API concerns from the runner.
-
- 26 Mar, 2025 1 commit
-
-
Jesse Gross authored
Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890
-
- 14 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
This commit refactors the LLM subsystem by removing internal subprocess request and response types. It consolidates duplicate type definitions across the codebase, moving them to centralized locations. The change also standardizes interfaces between components, simplifies the ServerStatusResp struct, and moves the ParseDurationMs function to a common package. This cleanup reduces code duplication between different runner implementations (llamarunner and ollamarunner).
-