Commits · fa7776fd2458fc3a8aeb7f12e4bc65b439955319 · OpenDAS / ollama

05 Aug, 2025 1 commit

Michael Yang authored Aug 05, 2025



* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>

fa7776fd

26 Jun, 2025 3 commits
- fs/ggml: add multiplier in graph estimates (#11208) · ba049026
  Jeffrey Morgan authored Jun 26, 2025
  
  ba049026
- fs/ggml: add missing architecture to OllamaEngineRequired() (#11206) · 3944602f
  Jeffrey Morgan authored Jun 26, 2025
  
  3944602f
- add new gemma model (#11204) · 73b642e6
  Michael Yang authored Jun 25, 2025
```
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
```
  73b642e6
20 Jun, 2025 1 commit
- Reapply "feat: incremental gguf parser (#10822)" (#11114) (#11119) · 0a066cfd
  Michael Yang authored Jun 20, 2025
```
* Reapply "feat: incremental gguf parser (#10822)" (#11114)

This reverts commit a6e64fbd.

* fix older ggufs
```
  0a066cfd
18 Jun, 2025 1 commit
- Revert "feat: incremental gguf parser (#10822)" (#11114) · a6e64fbd
  Jeffrey Morgan authored Jun 18, 2025
```
This reverts commit 6b04cad7.
```
  a6e64fbd
16 Jun, 2025 1 commit
- gguf: fix write order (#11068) · a6fbfc88
  Michael Yang authored Jun 16, 2025
```
* ggml: test write gguf order
* ggml: fix write tensor order
```
  a6fbfc88
12 Jun, 2025 1 commit

feat: incremental gguf parser (#10822) · 6b04cad7

Michael Yang authored Jun 12, 2025



* incremental gguf parser
* gguf: update test to not rely on gguf on disc
* re-use existing create gguf
* read capabilities from gguf kv
* kv exists
* update tests
* s/doneFunc/successFunc/g
* new buffered reader

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

6b04cad7

19 May, 2025 1 commit

ggml: Seperate tensor load from backend creation · 94ab428e

Jesse Gross authored Apr 17, 2025

Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.

94ab428e

14 May, 2025 3 commits
- ggml: update qwen25vl vision size estimate (#10711) · bd68d3ae
  Bruce MacDonald authored May 14, 2025
  
  bd68d3ae
- model: add Qwen2.5-VL support (#10385) · 0aa8b371
  Bruce MacDonald authored May 13, 2025
  
  0aa8b371
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
12 May, 2025 1 commit

Follow up to #10363 (#10647) · 9d6df908

Daniel Hiltgen authored May 12, 2025

The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.

9d6df908

07 May, 2025 1 commit
- fix data race in WriteGGUF (#10598) · af31ccef
  Daniel Hiltgen authored May 06, 2025
```
err in the go routine should not be shared with the outer scope
```
  af31ccef
06 May, 2025 1 commit

Move quantization to new backend (#10363) · 42481045

Daniel Hiltgen authored May 06, 2025

* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.

42481045

05 May, 2025 1 commit
- ggml: Reduce log level of "key not found" · 70736007
  Jesse Gross authored May 05, 2025
```
Most of the time this is not an error.
```
  70736007
01 May, 2025 1 commit

fix: write gguf padding (#10510) · a7835c67

Michael Yang authored Apr 30, 2025

* add gguf_test

* fix padding

padding was being added to offset but not to the running count

a7835c67

27 Apr, 2025 1 commit

ggml: fix crash for array head counts · 6ed88985

Devon Rifkin authored Apr 25, 2025

If it's an array, it uses the max value in the array

If array values for head counts becomes more popular, we can consider a
more invasive change like #10225 to calculate more accurate estimates.

Fixes: #9984

6ed88985

25 Apr, 2025 9 commits
- memory · f0ad49ea
  Michael Yang authored Apr 23, 2025
  
  f0ad49ea
- llama4 · f0c66e6d
  Michael Yang authored Apr 03, 2025
  
  f0c66e6d
- fix parameter count · ced7d0e5
  Michael Yang authored Apr 23, 2025
  
  ced7d0e5
- default slice values · a0dba0f8
  Michael Yang authored Apr 23, 2025
  
  a0dba0f8
- update comment · 5e20b170
  Michael Yang authored Apr 23, 2025
  
  5e20b170
- fix token type · d26c18e2
  Michael Yang authored Apr 23, 2025
  
  d26c18e2
- zero means zero · 8d376acc
  Michael Yang authored Apr 23, 2025
```
use a default of 1024 when asking for zero is confusing since most calls
seem to assume 0 means do not ready any data
```
  8d376acc
- generic ggml.array · 5d027916
  Michael Yang authored Apr 23, 2025
  
  5d027916
- convert: change to colmajor · 4892872c
  Michael Yang authored Apr 25, 2025
  
  4892872c
16 Apr, 2025 1 commit
- fix write gguf padding · 2fec73ee
  Michael Yang authored Apr 11, 2025
  
  2fec73ee
03 Apr, 2025 2 commits

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983

fs: move ml.Config to fs package · 3b96a936
Michael Yang authored Mar 18, 2025

3b96a936

26 Mar, 2025 1 commit

ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3

Jesse Gross authored Mar 24, 2025

Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890

f66216e3

13 Mar, 2025 6 commits
- count non-repeating vision layers · 8d76fa23
  Michael Yang authored Mar 13, 2025
  
  8d76fa23
- fix divide by zero · 65b88c54
  Michael Yang authored Mar 13, 2025
  
  65b88c54
- roughly count gemma3 graph · a422ba39
  Michael Yang authored Mar 13, 2025
```
the largest operation is by far (q @ k) so just count that for
simplicity
```
  a422ba39
- count all vision tensors · d2ec2237
  Michael Yang authored Mar 12, 2025
  
  d2ec2237
- count gemma3 vision tensors · 033cec23
  Michael Yang authored Mar 12, 2025
  
  033cec23
- add verbose mode to the show command (#9640) · 4bed7392
  Patrick Devine authored Mar 13, 2025
```
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
```
  4bed7392
11 Mar, 2025 2 commits
- llm: auto detect models that require Ollama Engine (#1 ) · ab39e08e
  Daniel Hiltgen authored Mar 11, 2025
  
  ab39e08e
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
04 Mar, 2025 1 commit

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c