- 01 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.
-
- 30 Sep, 2025 1 commit
-
-
Jesse Gross authored
For each memory allocation we report the size of the (attempted) allocation and whether it succeeded or failed. The latter status reporting proved to be not that useful in practice as systems such as Windows can automatically overflow from VRAM into RAM, resultings in successful allocations even when there isn't enough memory where we wanted. As a result, this information is only used for debug logging, which isn't worthwhile enough for the amount of code. It also isn't fully accurate, as multiple allocations may result in partial failures.
-
- 12 Sep, 2025 2 commits
-
- 11 Sep, 2025 2 commits
-
-
Jesse Gross authored
If a model with a split vision projector is loaded in the Ollama engine, the projector will be ignored and the model will hallucinate a response. Instead, fallback and try to load the model in the llama engine.
-
Jesse Gross authored
New memory estimates (see #11090 for more information) are now enabled automatically for all models running on the Ollama engine, improving both stability and performance through more accurate sizing and allocation. Models running on the llama engine will continue to use the original style of memory estimation.
-
- 10 Sep, 2025 2 commits
-
-
Jesse Gross authored
If flash attention is enabled without KV cache quanitization, we will currently always get this warning: level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""
-
Parth Sareen authored
-
- 09 Sep, 2025 1 commit
-
-
Jesse Gross authored
The context must always be able to store the current batch, so if the user requests a small context then we should also shrink the batch to match. This also fixes the TestLongInputContext test on the new engine. (The old engine already has this behavior.)
-
- 08 Sep, 2025 1 commit
-
-
Parth Sareen authored
-
- 02 Sep, 2025 2 commits
-
-
Michael Yang authored
-
Jesse Gross authored
If a GPU's free memory is less than the reserved amount, we might get an underflow. Since it is an unsigned uint64, we print this as a large number rather than the more correct 0. This only affects logging, the actual layout code already handles this correctly. Bug #12138
-
- 29 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* Always filter devices Avoid crashing on unsupported AMD iGPUs * Remove cuda device filtering This interferes with mixed setups
-
- 26 Aug, 2025 1 commit
-
-
Michael Yang authored
-
- 20 Aug, 2025 1 commit
-
-
Jesse Gross authored
With old memory estimates, it's currently impossible to load more than one model at a time when no GPUs are available. This is because the check for whether we need to evict a model looks to see if all layers of the new model can be loaded onto GPUs, which is never true if there are no GPUs. Before the memory management changes, there was a special code path for CPU-only systems. This problem does not exist with new memory estimates. Fixes #11974
-
- 18 Aug, 2025 1 commit
-
-
Jesse Gross authored
We dump out our best memory estimate after we complete processing for any reason, including errors. This is helpful for finding what what stopped us in error conditions but in some cases we might not have gotten even the first result yet. Fixes #11957
-
- 14 Aug, 2025 1 commit
-
-
Jesse Gross authored
This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.
-
- 23 Jun, 2025 2 commits
-
-
Daniel Hiltgen authored
For smaller context models, make sure we do not exceed the training size.
-
Daniel Hiltgen authored
* Re-remove cuda v11 Revert the revert - drop v11 support requiring drivers newer than Feb 23 This reverts commit c6bcdc42. * Simplify layout With only one version of the GPU libraries, we can simplify things down somewhat. (Jetsons still require special handling) * distinct sbsa variant for linux arm64 This avoids accidentally trying to load the sbsa cuda libraries on a jetson system which results in crashes. * temporary prevent rocm+cuda mixed loading
-
- 29 May, 2025 1 commit
-
-
Jesse Gross authored
"POST predict" basically means that the runner has crashed, which can have many reasons. However, many people think this is a specific error and either report only this message or group together unrelated bugs. This replaces it with a more friendly and helpful message.
-
- 19 May, 2025 1 commit
-
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 14 May, 2025 1 commit
-
-
Michael Yang authored
-
- 13 May, 2025 1 commit
-
-
Daniel Hiltgen authored
Bring back v11 until we can better warn users that their driver is too old. This reverts commit fa393554.
-
- 12 May, 2025 1 commit
-
-
Michael Yang authored
reduce prompt log to trace level
-
- 07 May, 2025 2 commits
-
-
Daniel Hiltgen authored
If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.
-
Daniel Hiltgen authored
This reduces the size of our Windows installer payloads by ~256M by dropping support for nvidia drivers older than Feb 2023. Hardware support is unchanged. Linux default bundle sizes are reduced by ~600M to 1G.
-
- 05 May, 2025 1 commit
-
-
Jeffrey Morgan authored
Some options listed in api/types.go are not supported in newer models, or have been deprecated in the past. This is the first of a series of PRs to clean up the API options
-
- 03 May, 2025 2 commits
-
-
Daniel Hiltgen authored
For all search path env vars make sure our dirs are first to avoid potentially finding other incompatible libraries on the users system. Also fixes a minor build script glitch for windows rocm
-
Daniel Hiltgen authored
This enhances our logging in the scheduler. The initial "waiting for server" log no longer claims an initial error state (now "not responding" which better reflects the actual state). Runners now have slog wiring to report more details about the runner, including PID.
-
- 30 Apr, 2025 1 commit
-
-
Daniel Hiltgen authored
Users may have other incompatible GGML installs on their systems. This will prevent us from trying to load them from the path.
-
- 24 Apr, 2025 1 commit
-
-
Parth Sareen authored
-
- 03 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
No functional change. Many different done reasons can be set at the runner level, so rather than obsuring them we should return them to the server process and let it choose what to do with the done reason. This separates the API concerns from the runner.
-
- 26 Mar, 2025 1 commit
-
-
Jesse Gross authored
Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890
-
- 14 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
This commit refactors the LLM subsystem by removing internal subprocess request and response types. It consolidates duplicate type definitions across the codebase, moving them to centralized locations. The change also standardizes interfaces between components, simplifies the ServerStatusResp struct, and moves the ParseDurationMs function to a common package. This cleanup reduces code duplication between different runner implementations (llamarunner and ollamarunner).
-
- 11 Mar, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 10 Mar, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 07 Mar, 2025 1 commit
-
-
Jesse Gross authored
We sometimes tokenize partial strings. For example, with multimodal inputs, we split the input string around the images and then tokenize each piece. In these cases, we should only add the special tokens on the first piece.
-
- 04 Mar, 2025 1 commit
-
-
Daniel Hiltgen authored
* Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine
-
- 14 Feb, 2025 2 commits
-
-
Jeffrey Morgan authored
provides a better approach to #9088 that will attempt to evaluate symlinks (important for macOS where 'ollama' is often a symlink), but use the result of os.Executable() as a fallback in scenarios where filepath.EvalSymlinks fails due to permission erorrs or other issues
-
Jeffrey Morgan authored
In some cases, the directories in the executable path read by filepath.EvalSymlinks are not accessible, resulting in permission errors which results in an error when running models. It also doesn't work well on long paths on windows, also resulting in errors. This change removes filepath.EvalSymlinks when accessing os.Executable() altogether
-