- 04 Aug, 2025 15 commits
-
-
Jesse Gross authored
When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account.
-
Jesse Gross authored
This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset.
-
Daniel Hiltgen authored
cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend.
-
Daniel Hiltgen authored
Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
-
Daniel Hiltgen authored
-
Michael Yang authored
-
Daniel Hiltgen authored
This exercises various operations and shapes on both CPU and GPU (if detected on the system)
-
Daniel Hiltgen authored
This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal.
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 31 Jul, 2025 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Jesse Gross authored
Models that use sliding window attention can only resume a sequence from the cache if it falls within the saved windows. This works well if the next message picks up where the old one left off. However, it generally prevents a partial prefix match unless the entire conversation falls within the sliding window. This can be a problem with reasoning models where the traces are supposed to be removed from future messages, forcing the entire history to be re-evaluated. This change allows models to specify that a larger amount of the history be retained in memory, to allow more partial resumption. It still respects the window that the model was trained on for token generation.
-
- 30 Jul, 2025 3 commits
-
-
Sajal Kulshreshtha authored
-
Daniel Hiltgen authored
This reverts commit 9d071e6089319b37acf62bb739e3430dcb2ac0c3.
-
Daniel Hiltgen authored
Support for bf16 was added in MacOS v14+ and attempting to enable on older versions causes runtime failures.
-
- 29 Jul, 2025 3 commits
-
-
Daniel Hiltgen authored
-
Oliver Simons authored
* Enable CUDA Graphs for gemma3n. Similar to https://github.com/ggml-org/llama.cpp/pull/14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks. * Remove residual check by reshaping differently in gemma3n model This should make the heuristics more robust
-
Jesse Gross authored
When we context shift, we delete half the context and apply RoPE with an offset to the other half. We used to RoPE across the entire context in a single pass with a zero offset for the deleted section. With the change to shifting in batches, we can skip any batches where all of the offsets would be zero. This typically reduces the number of operations by half.
-
- 28 Jul, 2025 1 commit
-
-
Yoshi authored
-
- 27 Jul, 2025 1 commit
-
-
Mayan EDMS authored
-
- 25 Jul, 2025 2 commits
-
-
Jesse Gross authored
Currently, when we need to do a shift on the cache, it is one RoPE operation on the entire size of the cache (per layer). In some cases, this can create a compute graph that is larger than the forward pass since the forward pass is working in batches. Since we don't consider shifting in our memory estimates, it's possible for this to cause a crash if we run out of memory. By limiting the size of the RoPE calls to batch size chunks, we ensure that the shift will never exceed the size of the forward pass, since the forward pass will also contain a RoPE of the same size. This does not have a sigificant impact on performance since RoPE is a math operation that is mostly proportional to the size of its inputs. In theory defrag could have the same issue since it also creates a compute graph outside of the forward pass, however, since it is only copies, it does not require any working space.
-
Ruyut authored
-
- 24 Jul, 2025 2 commits
-
-
Patrick Devine authored
-
Jeffrey Morgan authored
-
- 23 Jul, 2025 2 commits
-
-
minxinyi authored
-
Michael Yang authored
-
- 22 Jul, 2025 2 commits
-
-
Patrick Devine authored
--------- Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
ycomiti authored
-
- 20 Jul, 2025 2 commits
-
-
Stefan Wärting authored
-
Jeffrey Morgan authored
Co-authored-by:frob <rick+github@frob.com.au>
-
- 19 Jul, 2025 1 commit
-
-
zmldndx authored
-
- 17 Jul, 2025 3 commits
-
-
Daniel Hiltgen authored
The macos-13 is x86, while macos-13-xlarge is arm64
-
frob authored
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-