- 11 Aug, 2025 3 commits
-
-
Devon Rifkin authored
server: fix error when parsing bad harmony tool calls
-
Devon Rifkin authored
Thanks @moll for reporting! Fixes: #11781
-
Daniel Andersen authored
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs. Benefits: - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed - Allowing unallocated GPUs to get into power-saving mode. - Significantly reduce VRAM allocation when using more than 2 GPUs in a system - Due to the reduced memory allocation, you can run more models simultaneously.
-
- 10 Aug, 2025 1 commit
-
-
Michael Vorburger authored
-
- 08 Aug, 2025 3 commits
-
-
Jesse Gross authored
Callers can set a backend buffer type to be no-alloc, meaning that it does not allocate memory for tensors or operations. This can be used for calculating memory requirements. Tensors and graphs must be recreated with no-alloc set to false before loading data. Defaults to false for newly created backend buffer types.
-
Jesse Gross authored
In order to iteratively find the best memory allocation, we need to be able to free backend memory so we can try again.
-
Jesse Gross authored
For many backend data structures, GGML defines a typedef of a pointer type and returns these from functions. In most cases, CGo understands that these are interchangable but some parts of Go (such as generics) think they are two different types. We should prefer the form that GGML uses.
-
- 07 Aug, 2025 6 commits
-
-
Daniel Hiltgen authored
Also wires up support to override the default "smol" model
-
Jesse Gross authored
gpt-oss works best with a context length of at least 8k. However, for GPUs with limited amount of VRAM, there is a significant performance hit to this increased context. In these cases, we switch to the Ollama default of 4k
-
Devon Rifkin authored
openai: always provide reasoning
-
Devon Rifkin authored
We were missing passing along thinking if content was nil (as opposed to empty string) Also added a test for content not being passed, which was the real cause of <https://github.com/ollama/ollama/issues/11704>, since with the way `Content` is typed, not passing it and empty string are distinct
-
Devon Rifkin authored
openai: when converting role=tool messages, propagate the tool name
-
Devon Rifkin authored
Added support for converting both `name` and `tool_call_id` fields, which different clients might provide. `name` is a legacy field from the OpenAI completions API. For `tool_call_id` we inspect previous messages and look for a matching tool call ID and grab its name Issue: https://github.com/ollama/ollama/issues/11704
-
- 06 Aug, 2025 7 commits
-
-
Patrick Devine authored
-
Devon Rifkin authored
openai: allow for content _and_ tool calls in the same message
-
Devon Rifkin authored
Previously our OpenAI chat completions compat layer assumed that tool calls and content would never be provided together, but this is not a correct assumption. Content is only optional when tool calls are present, but tool calls and content can be provided together Fixes: https://github.com/ollama/ollama/issues/11704
-
Daniel Hiltgen authored
-
Gao feng authored
update api.md to make it consist with code. https://github.com/ollama/ollama/blob/main/server/download.go#L447
-
Parth Sareen authored
-
Devon Rifkin authored
tools: support anyOf types
-
- 05 Aug, 2025 6 commits
-
-
Devon Rifkin authored
afaik gpt-oss is the first model that meaningfully transforms tool function definitions in its template. We found that relatively common definitions that include `anyOf` were not working because the template was assuming that types were always defined via a `type` field. anyOf allows for fully recursive types, so I exposed a `toTypeScriptType()` function to handle this recursive logic in go and keep the templates cleaner. The gpt-oss templates will need to be updated to use this. We should keep building out our function definition support to more fully support the parts of json schema that make sense for this use case, but in the meantime this will unblock some users (e.g., zed's ollama integration w/ gpt-oss). Probably the most urgent is proper array support
-
Daniel Hiltgen authored
This should help reduce the runtime dependencies on windows.
-
Michael Yang authored
-
Jeffrey Morgan authored
-
Jesse Gross authored
KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671
-
Michael Yang authored
* bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by:
Daniel Hiltgen <daniel@ollama.com> Co-authored-by:
Jesse Gross <jesse@ollama.com> Co-authored-by:
Devon Rifkin <drifkin@drifkin.net>
-
- 04 Aug, 2025 1 commit
-
-
Jesse Gross authored
There is a bug when using sliding window attention where we run out of KV cache slots. This is likely due to not correctly removing all of the entries as they slide out of range. This adds additional logging when this occurs to track down the source. Bug #10127
-
- 31 Jul, 2025 1 commit
-
-
Jesse Gross authored
Models that use sliding window attention can only resume a sequence from the cache if it falls within the saved windows. This works well if the next message picks up where the old one left off. However, it generally prevents a partial prefix match unless the entire conversation falls within the sliding window. This can be a problem with reasoning models where the traces are supposed to be removed from future messages, forcing the entire history to be re-evaluated. This change allows models to specify that a larger amount of the history be retained in memory, to allow more partial resumption. It still respects the window that the model was trained on for token generation.
-
- 30 Jul, 2025 3 commits
-
-
Sajal Kulshreshtha authored
-
Daniel Hiltgen authored
This reverts commit 9d071e6089319b37acf62bb739e3430dcb2ac0c3.
-
Daniel Hiltgen authored
Support for bf16 was added in MacOS v14+ and attempting to enable on older versions causes runtime failures.
-
- 29 Jul, 2025 3 commits
-
-
Daniel Hiltgen authored
-
Oliver Simons authored
* Enable CUDA Graphs for gemma3n. Similar to https://github.com/ggml-org/llama.cpp/pull/14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks. * Remove residual check by reshaping differently in gemma3n model This should make the heuristics more robust
-
Jesse Gross authored
When we context shift, we delete half the context and apply RoPE with an offset to the other half. We used to RoPE across the entire context in a single pass with a zero offset for the deleted section. With the change to shifting in batches, we can skip any batches where all of the offsets would be zero. This typically reduces the number of operations by half.
-
- 28 Jul, 2025 1 commit
-
-
Yoshi authored
-
- 27 Jul, 2025 1 commit
-
-
Mayan EDMS authored
-
- 25 Jul, 2025 2 commits
-
-
Jesse Gross authored
Currently, when we need to do a shift on the cache, it is one RoPE operation on the entire size of the cache (per layer). In some cases, this can create a compute graph that is larger than the forward pass since the forward pass is working in batches. Since we don't consider shifting in our memory estimates, it's possible for this to cause a crash if we run out of memory. By limiting the size of the RoPE calls to batch size chunks, we ensure that the shift will never exceed the size of the forward pass, since the forward pass will also contain a RoPE of the same size. This does not have a sigificant impact on performance since RoPE is a math operation that is mostly proportional to the size of its inputs. In theory defrag could have the same issue since it also creates a compute graph outside of the forward pass, however, since it is only copies, it does not require any working space.
-
Ruyut authored
-
- 24 Jul, 2025 2 commits
-
-
Patrick Devine authored
-
Jeffrey Morgan authored
-