- 06 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 05 Aug, 2025 1 commit
-
-
Michael Yang authored
* bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by:
Daniel Hiltgen <daniel@ollama.com> Co-authored-by:
Jesse Gross <jesse@ollama.com> Co-authored-by:
Devon Rifkin <drifkin@drifkin.net>
-
- 11 Jul, 2025 2 commits
-
-
Jesse Gross authored
Reporting params.NumGPULayers can be misleading because it is the requested number of layers, not the actual number that is loaded. While they are often the same, there are cases where they might mismatch, such as if the GPU backend is missing.
-
Jesse Gross authored
We're not currently using it, even in cases where we could. Disabling it improves generation performance by 10-30% with multiple GPUs.
-
- 09 Jul, 2025 1 commit
-
-
Jesse Gross authored
We don't get valid UUIDs for AMD GPUs on Windows, so the best option is to use the ordinal IDs. This brings us in line with what we currently do on the Ollama server - the only exception is AMD GPUs on Linux, which falls back to using ordinal IDs. The GGML implementation has no fallback but it doesn't appear to occur for any of the GPUs that we support. It's also possible that there are collisions between ordinal IDs for different libraries - however the only places where we use them are AMD on Windows and Metal on Mac, which can never occur on the same system.
-
- 07 Jul, 2025 1 commit
-
-
Jesse Gross authored
The root cause was an unclean upgrade - this code is fine. This reverts commit 45f216a9.
-
- 02 Jul, 2025 1 commit
-
-
Daniel Hiltgen authored
This adds some extra logs to make the new engine a bit more consistent with the llama engine.
-
- 27 Jun, 2025 1 commit
-
-
Jesse Gross authored
This is causing segfaults, so disable it. Currently UUIDs are only used for debugging purposes, although they planned to be used in additional ways in the future. Bug #11211
-
- 26 Jun, 2025 1 commit
-
-
Michael Yang authored
* update patches * cherry pick metal mean kernel * cherry pick cuda mean kernel * gemma3n
-
- 20 Jun, 2025 1 commit
-
-
Jesse Gross authored
We don't check the return status after computing the graph, which can silently lead to bad outputs if we try to keep going and future computation succeeds. This appears to happens in certain cases on Apple M2 devices. Fixes #11070
-
- 18 Jun, 2025 2 commits
-
-
Jeffrey Morgan authored
Reverts PR #11115. The original change was mistakingly reverted instead of #10822
-
Jeffrey Morgan authored
This reverts commit aaa78180.
-
- 29 May, 2025 1 commit
-
-
Jesse Gross authored
This enables matching up devices and information reported by the backend with system management libraries such as nvml to get accurate free memory reporting.
-
- 22 May, 2025 2 commits
-
-
Jesse Gross authored
FromFloatSlice and FromIntSlice return an error if the shape doesn't match the passed data or if memory can't be allocated. Since these are inputs, the memory being allocated is system memory rather than VRAM. In many cases, the caller can't really handle the error and panics. Empty and Zeros directly panic if they can't allocate memory. This makes things consistent by panicing for the first two cases, removing a fair amount of error handling code. This is also consistent with how Go typically handles these situations.
-
Jesse Gross authored
This provides granular information about the backend memory allocations required by the runner: - Per backend - Per layer - Weights, cache and graph - Allocation status This can be used for debugging and validating memory estimates.
-
- 21 May, 2025 1 commit
-
-
Michael Yang authored
* feat: qwen3 dense * feat: qwen3moe * fix llama4 moe
-
- 20 May, 2025 1 commit
-
-
Michael Yang authored
-
- 19 May, 2025 1 commit
-
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 15 May, 2025 1 commit
-
-
Michael Yang authored
* panic if trying to pad 4d * fix pixel values padding
-
- 14 May, 2025 2 commits
-
-
Bruce MacDonald authored
-
Michael Yang authored
-
- 12 May, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Michael Yang authored
reduce prompt log to trace level
-
- 06 May, 2025 1 commit
-
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 02 May, 2025 1 commit
-
-
Jesse Gross authored
Successfully completing processing with an errgroup cancels the associated context. However, we also have a goroutine that is checking for cancelation of the context. As a result, there is a race where the goroutine can pick up the cancelation and report an error, replacing the sucessful error message. To avoid that, this replaces the goroutine with a cancelation check when we are reading files. This also has the advantage of stopping all reads relatively quickly on error and also ensuring that there are no outstanding I/O operations when we return in this case. The downside is that if a file read blocks forever (for example, over the network) then cancelation of the context effectively won't be honored. However, this is also true for other smaller files we read and the tensors are read in small chunks (128K), so it's consistent and better on balance overall.
-
- 25 Apr, 2025 1 commit
-
-
Michael Yang authored
-
- 18 Apr, 2025 1 commit
-
-
Michael Yang authored
-
- 11 Apr, 2025 4 commits
-
-
Jesse Gross authored
For every forward pass through the model, we need to allocate input tensors: tokens, images, positions, outputs and masks. These get allocated in system memory. However, when we close the context that the tensors were allocated through, the metadata gets freed but the actual backend memory does not. This results in a significant memory leak. This makes it so that all the memory allocated through a context gets freed when it is closed. Fixes #10040
-
Jesse Gross authored
Allocating (and in particular, freeing) memory from CUDA host buffers is expensive and can cause a significant performance hit if we do it for every token. Using normal system memory avoids this issue and also gives the OS more flexibility to manage it. There is no performance impact from this patch directly (either positive or negative) but it makes a difference once we start freeing memory correctly.
-
Jesse Gross authored
Context is currently mixed between pointer and value receivers. Change this to be all pointer receivers so don't have to reason about whether the things we are updating in the struct will be retained.
-
Jesse Gross authored
Sometimes loading the GGUF file fails with: panic: context canceled This is probably a filesystem error but it doesn't provide any information about what happened.
-
- 08 Apr, 2025 2 commits
-
-
Jesse Gross authored
Currently, the KV cache and graph are lazily allocated as needed. The cache is fully allocated on first use of the corresponding layer whereas the graph grows with the size of the context. This can be an issue if another application allocates more VRAM after we do our calculations - Ollama will crash in the middle of inference. If we instead allocate the maximum needed memory at startup of the runner, we will either succeed or fail at that point rather than at some surprising time in the future. Currently, this only generates a worst case batch for text, which means that vision models may get a partial allocation and continue to lazily allocate the rest.
-
Jesse Gross authored
If there is a CUDA OOM, we currently don't check the return value and will evetually segfault. This checks for the problem and generates a Go error. At the moment, this will still result in a panic but having the error is the first step to being able to handle it more gracefully.
-
- 05 Apr, 2025 1 commit
-
-
Daniel Hipke authored
improves model loading times on network-based filesystems such as GCS fuse by creating a dedicated file descriptor for each section of the file being read, reducing seeking
-
- 03 Apr, 2025 2 commits
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
Michael Yang authored
-
- 27 Mar, 2025 1 commit
-
-
Jesse Gross authored
Model implementations should use Input for all of their tensors supplied to the model. This includes tensors that relate to the outputs, which is confusing since there is also an Output funciton. Since Output is only used internally in GGML and not used by any model implementations, we can remove it from the interface to reduce confusion.
-
- 21 Mar, 2025 1 commit
-
-
Michael Yang authored
-
- 18 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
-
- 17 Mar, 2025 1 commit
-
-
Jeffrey Morgan authored
-