Commits · 33ee7168ba1e16c813b52dc2c9417efa1e2e9f20 · OpenDAS / ollama

09 Jan, 2026 1 commit

Add experimental MLX backend and engine with imagegen support (#13648) · 33ee7168

Daniel Hiltgen authored Jan 08, 2026



* WIP - MLX backend with gemma3

* MLX: add cmake and go tag build toggles

To build the new MLX backend code:
  cmake --preset MLX
  cmake --build --preset MLX --parallel
  cmake --install build --component MLX
  go build -tags mlx .

Note: the main.go entrypoint for the MLX engine will change in a follow up commit.

* add experimental image generation runtime

* add experimental image generation runtime

* MLX: wire up cuda build for linux

* MLX: get dependencies correct and dedup

This is still too large for a unified github artifact, but is now "correct" for the mlx_cuda_v13
directory.

* fix relative link bug in dedup

* Add darwin build and readme

* add go build tag for mlx dependent code and wire up build_darwin.sh

* lint cleanup

* macos: build mlx for x86

This will be CPU only.

* cuda build instructions and fix drift from mlx bump

* stale comment

* Delete agent helper doc

* Clean up readme.md

* Revise README for tokenizer clarity and details

Updated README to clarify tokenizer functionality and removed correctness section.

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>

33ee7168

16 Dec, 2025 1 commit
- use ollama engine for bert models (#13501) · 903b1fc9
  Michael Yang authored Dec 16, 2025
```
register bpe tokenizer which enables granite-embedding
```
  903b1fc9
15 Dec, 2025 1 commit
- model: add olmo3 and olmo3.1 (#13415) · ffbe8e07
  Parth Sareen authored Dec 15, 2025
  
  ffbe8e07
12 Dec, 2025 1 commit

flash attn: add auto mode for llama engine (#13052) · bd6c1d6b

Daniel Hiltgen authored Dec 12, 2025

* flash attn: add auto mode for llama engine

If the user does not specify fa in the environment, use auto-mode.

* review comments

* ensure kv cache quantized types have FA explicitly enabled

additional review comments

bd6c1d6b

08 Dec, 2025 1 commit
- fs/ggml: write int32 and int64 values to gguf files (#13335) · 5a41d69b
  Jeffrey Morgan authored Dec 07, 2025
  
  5a41d69b
04 Dec, 2025 1 commit
- llm: Enable flash attention for mistral3 by default · 9191dfaf
  Jesse Gross authored Dec 04, 2025
  
  9191dfaf
19 Nov, 2025 3 commits
- nomic-embed: nomic-embed-text defaulted to ollama runner (#13144) · b2af5096
  nicole pardal authored Nov 19, 2025
  
  b2af5096
- models: enable deepseek2 (deepseek v3.1 w/ MLA) on the new engine (#13151) · 604e43b2
  Patrick Devine authored Nov 18, 2025
  
  604e43b2
- deepseekocr · 92981ae3
  Michael Yang authored Oct 31, 2025
  
  92981ae3
18 Nov, 2025 1 commit
- migrate to golangci-lint v2 (#13109) · 718961de
  Michael Yang authored Nov 18, 2025
```
* migrate to golangci-lint v2
* copyloopvar
```
  718961de
11 Nov, 2025 1 commit

llm: Use Ollama engine memory layouts for both old and new engines · f560bd07

Jesse Gross authored Nov 05, 2025

Currently for both the old and new engines, there is code to
calculate how much memory is required for a model and lay out
the layers onto GPUs. This reuses the new engine's lay out code
for the old engine as well, bringing them closer together. The
old engine continues to use its current method of estimating
required memory.

This reduces maintainence effort and improves consistency, as new
features only need to be implemented in one place. The newer code
is also more accurate, especially with multiple GPUs.

f560bd07

30 Oct, 2025 1 commit
- qwen3vl: enable flash attention by default (#12862) · 75e75d9a
  Michael Yang authored Oct 30, 2025
  
  75e75d9a
29 Oct, 2025 1 commit
- feat(model): add qwen3vl (#12665) · 7d25b9e1
  Michael Yang authored Oct 28, 2025
  
  7d25b9e1
20 Oct, 2025 1 commit
- fs(ggml): fill in arch prefix if necessary (#12646) · d2b63c19
  Michael Yang authored Oct 20, 2025
  
  d2b63c19
16 Oct, 2025 1 commit
- fs/ggml: fix function name in comment (#12630) · de670570
  zhetaicheleba authored Oct 16, 2025
  
  de670570
15 Oct, 2025 1 commit
- llm: Enable flash attention by default for gemma3 · c3c85aa0
  Jesse Gross authored Oct 15, 2025
  
  c3c85aa0
13 Oct, 2025 1 commit
- Revert "use llama runner for qwen3 (#12556)" · 18087f2e
  Michael Yang authored Oct 13, 2025
```
This reverts commit 3d32249c.
```
  18087f2e
10 Oct, 2025 1 commit
- use llama runner for qwen3 (#12556) · 3d32249c
  Jeffrey Morgan authored Oct 09, 2025
  
  3d32249c
03 Oct, 2025 2 commits
- llm: Support KV cache quantization with gpt-oss · 19e6796e
  Jesse Gross authored Oct 03, 2025
```
With the new version of GGML in #12245, KV cache quantization
no longer causes a fallback to CPU.
```
  19e6796e
- llm: Enable flash attention by default for qwen3 and qwen3moe · 0bda7289
  Jesse Gross authored Oct 02, 2025
  
  0bda7289
24 Sep, 2025 1 commit
- prefer ollama engine for qwen3moe (#12374) · 2e742544
  Michael Yang authored Sep 24, 2025
  
  2e742544
17 Sep, 2025 1 commit
- prefer ollama engine for qwen3 (#12310) · a417ac97
  Michael Yang authored Sep 17, 2025
  
  a417ac97
10 Sep, 2025 2 commits

ggml: Disable flash attention for gemma2 · 29ddfc2c

Jesse Gross authored Sep 09, 2025

Our new engine implementation of gemma2 doesn't support flash
attention, which means that it also doesn't support KV cache
quantization. Currently, it is possible to turn these two on,
which will result in a crash.

29ddfc2c

llm: Remove unneeded warning with flash attention enabled · 71cb86af

Jesse Gross authored Sep 09, 2025

If flash attention is enabled without KV cache quanitization, we will
currently always get this warning:
level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""

71cb86af

08 Sep, 2025 1 commit

Hybrid and recurrent memory estimates (#12186) · 7b91c9ce

Gabe Goodhart authored Sep 08, 2025

This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers.

The logic for the sizing of the recurrent layers comes from the llama.cpp implementation

ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size);
ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size);
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

7b91c9ce

26 Aug, 2025 3 commits

convert(gptoss): mxfp4 to ggml layout to avoid jit conversion (#12018) · 59412fbb
Michael Yang authored Aug 26, 2025
```
* convert: return bytes written

* ggml flavor mxfp4

* simplify jit conversion

* comment
```
59412fbb

convert: fix tensor sorting (#12015) · 86834a27

Michael Yang authored Aug 26, 2025

there's two bugs here.

1. the check for a layer id is incorrect and should be >= 0 since layer
   0 is valid
2. if both tensors have an layer identifier, it will only compare the
   layer id which will return 0 if the tensors are in the same layer.
   instead it should fallback to comparing the full tensor name

86834a27

gptoss: enable flash attention by default (#11996) · 85ccf735
Michael Yang authored Aug 26, 2025

85ccf735

15 Aug, 2025 1 commit
- gpt-oss: disable quantized kv cache (#11929) · df335aac
  Michael Yang authored Aug 15, 2025
  
  df335aac
14 Aug, 2025 2 commits

llm: New memory management · d5a0d8d9

Jesse Gross authored May 29, 2025

This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).

It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.

d5a0d8d9

update vendored llama.cpp and ggml (#11823) · 1a19df1f

Michael Yang authored Aug 14, 2025

* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch

This will be redone once my branch is merged upstream in llama.cpp

* feat: Update all patches

There are a number that are no longer needed at all:

- 0003-embeddings: Embeddings entirely overhauled on master
- 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely
    overhauled on master
- 0019-metal-add-mean-kernel-14267: Merged upstream
- 0020-CUDA-add-mean-operation-14313: Merged upstream

* feat: Sync llama.cpp and ggml

* fix: Update rsync-filter for all moved/new/removed files

* fix: Add files missing from sync

* fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs

* fix: Add ggml files missing from sync

* fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files

* fix: Remove mtmd main cpp files

* fix: Add missing include in sampling_ext.cpp

* fix: Update llama.go to use mtmd instead of clip/llava

* fix: Add patch for mtmd_input_text

* chore: Ignore *.patched in the patch directory

* fix: Fix support for arch-specific ggml-cpu source files with new arrangement

In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific
implementations were split out into a nested tree structure under
ggml-cpu/arch. This conflicts with standard CGO layout where all
arch-specific source files are expected to live in the same directory as
the parent go module and use suffixes based on GOOS and GOARCH. As such,
there were really two options for getting this to work:

1. Add a patch on top of the GGML sync to rearrange the files to match the
GO layout convention
2. Use CGO directives to conditionally include the nested source files in
the compilation units

This commit does (2) in order to minimize the set of changes needed on top
of the upstream file layout. To get this to work, there are two key things
needed:

1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in
the preprocessor directives
2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to
explicitly include the .c|.cpp files for the given architecture from the
nested directory

* fix: Use mtmd_helper to correctly load the bitmap for the image

* fix: Apply patch for mtmd_text_input

* fix: Add missing stb to llama.cpp rsync-filter

* fix: Add sync'ed stb vendored header

* fix: Use c++17 and include vendor for go wrapper modules

* fix: Update patch 0015 for upstream implementation of uuid

* feat: Bump to the latest tip of the branch

* fix: Update patches for bump

* feat: Bump back to the cenral repo and point at the latest master

This includes granite 4 and a number of other model architectures!

* fix: Revert changes to ggml export GPU UUID patch

* fix: Add patch for GGML_VERSION and GGML_COMMIT constants

* feat: Sync all patched code

* build: Include cmake/common.cmake in ggml sync

* build: Add top-level include for GNUINstallDirs in CMakeLists.txt

This is used to populate CMAKE_INSTALL_BINDIR

* fix: Add a patch to avoid power throttling API on non-msvc windows builds

* fix: Sync patch changes for ggml-cpu.c

* feat: Bump llama.cpp to 4a4f42

This picks up support for Kimi K2 and PLaMO-2

* feat: Sync llama.cpp

* fix: Handle multi-chunk image encodings from mtmd

* fix: Re-number patches after merge with `main`

* feat: Bump to 41e78c in the makefile

* fix: Fix Solar and argsort/copy patches after bump

* fix: Remove Gemma3n CUDA Graphs patch

It was implemented upstream:
https://github.com/ggml-org/llama.cpp/pull/14741

* feat: Sync llama.cpp / ggml after latest bump

* build: Remove unnecessary CFLAGS definitions in cpu.go

* fix: Remove unnecessary additions in the rsync-filter

* fix: Remove unused vendored code for chat template parsing

* Revert "fix: Remove Gemma3n CUDA Graphs patch"

This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea.

* fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes

https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394



* fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n

* unwind mxfp4 patch

Prepare to bump ggml with their impl for mxfp4

* bump

* fix windows build error

* Convert tensors at load time

Repack the mxfp4 tensors as ggmls kernels expect them to be.

* convert mlp bf16 to f32

* buffer the conversion better

* reshape earlier

* openai swiglu

* add ids

* split qkv, gate_up

* fix nested alt tags

* fast attention

* remove debug messages

* fix lint

* remove redundant test

* remap values only if source/target are different

* add back i32->i32 copy

* refactor cpu quants

* clean up vendor

* update patch instructions

* clean up patches

* remove webgpu

* update mem

* also handle gpt-oss

* revert convert changes

---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

1a19df1f

05 Aug, 2025 3 commits

gptoss: fix memory calc (#11700) · fcec04bf
Michael Yang authored Aug 05, 2025

fcec04bf

ggml: Prevent kv cache quanitization on gpt-oss · 8253ad4d

Jesse Gross authored Aug 05, 2025

KV cache quantization has a dependency on the flash attention kernel.
We currently cannot use flash attention with gpt-oss as it requires
additional operations.

The model definition does not call flash attention, so it works
regardless of the setting but the cache will pick up the
quantization type. This updates the flash attention setting earlier
in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671

8253ad4d

gpt-oss (#11672) · fa7776fd

Michael Yang authored Aug 05, 2025



* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>

fa7776fd

26 Jun, 2025 3 commits
- fs/ggml: add multiplier in graph estimates (#11208) · ba049026
  Jeffrey Morgan authored Jun 26, 2025
  
  ba049026
- fs/ggml: add missing architecture to OllamaEngineRequired() (#11206) · 3944602f
  Jeffrey Morgan authored Jun 26, 2025
  
  3944602f
- add new gemma model (#11204) · 73b642e6
  Michael Yang authored Jun 25, 2025
```
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
```
  73b642e6
20 Jun, 2025 1 commit
- Reapply "feat: incremental gguf parser (#10822)" (#11114) (#11119) · 0a066cfd
  Michael Yang authored Jun 20, 2025
```
* Reapply "feat: incremental gguf parser (#10822)" (#11114)

This reverts commit a6e64fbd.

* fix older ggufs
```
  0a066cfd
18 Jun, 2025 1 commit
- Revert "feat: incremental gguf parser (#10822)" (#11114) · a6e64fbd
  Jeffrey Morgan authored Jun 18, 2025
```
This reverts commit 6b04cad7.
```
  a6e64fbd
16 Jun, 2025 1 commit
- gguf: fix write order (#11068) · a6fbfc88
  Michael Yang authored Jun 16, 2025
```
* ggml: test write gguf order
* ggml: fix write tensor order
```
  a6fbfc88