Commits · fa7776fd2458fc3a8aeb7f12e4bc65b439955319 · OpenDAS / ollama

05 Aug, 2025 1 commit

Michael Yang authored Aug 05, 2025



* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>

fa7776fd

23 Jul, 2025 1 commit
- server: use slices.Equal to simplify code (#11502) · 1e6eab5c
  minxinyi authored Jul 24, 2025
  
  1e6eab5c
22 Jul, 2025 1 commit
- Fix GetModelInfo (#11496) · 3bac5cba
  Patrick Devine authored Jul 22, 2025
```
---------
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  3bac5cba
08 Jul, 2025 2 commits

Reduce default parallelism to 1 (#11330) · 20c3266e

Daniel Hiltgen authored Jul 08, 2025

The current scheduler algorithm of picking the paralellism based on available
VRAM complicates the upcoming dynamic layer memory allocation algorithm. This
changes the default to 1, with the intent going forward that parallelism is
explicit and will no longer be dynamically determined. Removal of the dynamic
logic will come in a follow up.

20c3266e

API/CLI context enhancements (#11331) · 34088dbc

Daniel Hiltgen authored Jul 08, 2025

* API: expose context size of loaded models

* CLI: add context UX

This adds a column in the ps output to show the models context size.

34088dbc

27 Jun, 2025 1 commit
- skip quantizing per_layer_token_embd (#11207) · d0b32def
  Michael Yang authored Jun 26, 2025
```
this tensor isn't compatible with cuda when quantized to q4_K so skip it
```
  d0b32def
20 Jun, 2025 1 commit
- Reapply "feat: incremental gguf parser (#10822)" (#11114) (#11119) · 0a066cfd
  Michael Yang authored Jun 20, 2025
```
* Reapply "feat: incremental gguf parser (#10822)" (#11114)

This reverts commit a6e64fbd.

* fix older ggufs
```
  0a066cfd
18 Jun, 2025 2 commits
- Revert "feat: incremental gguf parser (#10822)" (#11114) · a6e64fbd
  Jeffrey Morgan authored Jun 18, 2025
```
This reverts commit 6b04cad7.
```
  a6e64fbd
- cache: fix comment function name in cache.go (#11110) · 60cfa2a2
  曹家巧 authored Jun 18, 2025
  
  60cfa2a2
12 Jun, 2025 2 commits

tools: loosen tool parsing to allow for more formats (#11030) · 9f8a18ec
Jeffrey Morgan authored Jun 12, 2025

9f8a18ec

feat: incremental gguf parser (#10822) · 6b04cad7

Michael Yang authored Jun 12, 2025



* incremental gguf parser
* gguf: update test to not rely on gguf on disc
* re-use existing create gguf
* read capabilities from gguf kv
* kv exists
* update tests
* s/doneFunc/successFunc/g
* new buffered reader

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

6b04cad7

07 Jun, 2025 1 commit
- Revert "server: add model capabilities to the list endpoint (#10174)" (#11004) · 09d308d6
  Jeffrey Morgan authored Jun 06, 2025
```
This reverts commit 09430011.
```
  09d308d6
06 Jun, 2025 1 commit
- move thinking logic into its own package (#10990) · a3b6886b
  Devon Rifkin authored Jun 06, 2025
```
move thinking logic into its own package
```
  a3b6886b
05 Jun, 2025 1 commit
- export ThinkingParser · 0683efa6
  Devon Rifkin authored Jun 05, 2025
  
  0683efa6
04 Jun, 2025 1 commit
- server: add model capabilities to the list endpoint (#10174) · 09430011
  JasonHonKL authored Jun 05, 2025
  
  09430011
29 May, 2025 1 commit

add thinking support to the api and cli (#10584) · 5f57b0ef

Devon Rifkin authored May 28, 2025

- Both `/api/generate` and `/api/chat` now accept a `"think"`
  option that allows specifying whether thinking mode should be on or
  not
- Templates get passed this new option so, e.g., qwen3's template can
  put `/think` or `/no_think` in the system prompt depending on the
  value of the setting
- Models' thinking support is inferred by inspecting model templates.
  The prefix and suffix the parser uses to identify thinking support is
  also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
  existing API consumers. If the `"think"` option is not specified, the
  behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
  in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
  or `--think=false` to control thinking, or during an interactive
  session they can use the commands `/se...

5f57b0ef

27 May, 2025 1 commit
- server: abort download on empty digest · 9239a254
  Kyle Steere authored May 27, 2025
```
Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev>
```
  9239a254
24 May, 2025 1 commit
- server: add hint to the error message when model path access fails (#10843) · eda472df
  frob authored May 24, 2025
  
  eda472df
23 May, 2025 1 commit
- tools: refactor tool call parsing and enable streaming (#10415) · e8b981fa
  Parth Sareen authored May 23, 2025
  
  e8b981fa
22 May, 2025 2 commits

sched: fix runner leak during reloading unload (#10819) · d950ff12

Daniel Hiltgen authored May 22, 2025

When the same model is being reloaded rapidly with client connections
being canceled before the model finishes loading, the queued unload
event could cause a leak of runners by deleting a different runner from
the loaded list.

d950ff12

server: improve tensor quantization fallback logic (#10806) · fbe6ae28

Bruce MacDonald authored May 22, 2025

Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.

fbe6ae28

21 May, 2025 1 commit

remove support for multiple ggufs in a single file (#10722) · 61aeaf7e

Michael Yang authored May 21, 2025

* remove support for multiple ggufs in a single file

this was an attempt to make it easier to import multimodal models into
ollama. this was rarely used and error prone so remove it

* fix: create fused model from blob

61aeaf7e

19 May, 2025 2 commits

avoid kv truncation during create (#10761) · 1a0cfd08
Daniel Hiltgen authored May 19, 2025

1a0cfd08

ggml: Seperate tensor load from backend creation · 94ab428e

Jesse Gross authored Apr 17, 2025

Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.

94ab428e

14 May, 2025 2 commits
- fix crash in old clients with quantization progress (#10710) · ff80718e
  Daniel Hiltgen authored May 14, 2025
```
Older clients assumed the digest was at least 19 characters long so increase the size
of the dummy digest to avoid array out of bounds crashes.
```
  ff80718e
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 1 commit
- server: add webp image input support (#10653) · c7f4ae7b
  Jeffrey Morgan authored May 12, 2025
  
  c7f4ae7b
12 May, 2025 3 commits

Follow up to #10363 (#10647) · 9d6df908

Daniel Hiltgen authored May 12, 2025

The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.

9d6df908

convert: quantize from safetensors needs kv (#10675) · ad035ad5

Bruce MacDonald authored May 12, 2025

When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.

ad035ad5

feat: add trace log level (#10650) · f95a1f2b
Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
f95a1f2b

08 May, 2025 2 commits

fix: stream accumulator exits early (#10593) · 0d6e35d3

Michael Yang authored May 08, 2025

the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.

0d6e35d3

lint: enable usetesting, disable tenv (#10594) · 6e9a7a25
Michael Yang authored May 08, 2025

6e9a7a25

07 May, 2025 2 commits

sched: fix race leading to orphaned runners (#10599) · 5e380c3b

Daniel Hiltgen authored May 07, 2025

If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight. The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance. The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.

5e380c3b

api: remove unused RetrieveModelResponse type (#10603) · 392de840
Jeffrey Morgan authored May 06, 2025

392de840

06 May, 2025 3 commits
- server: send 405 instead of 404 for unallowed methods (#10275) · 4090aca9
  Devon Rifkin authored May 06, 2025
```
Fixes: #5483
```
  4090aca9
- server: remove internal cmd (#10595) · 92ce438d
  Michael Yang authored May 06, 2025
  
  92ce438d
- Move quantization to new backend (#10363) · 42481045
  Daniel Hiltgen authored May 06, 2025
```
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
```
  42481045
05 May, 2025 1 commit
- server: fix panic when runner.Options is nil (#10566) · 1703d147
  Jeffrey Morgan authored May 05, 2025
  
  1703d147
03 May, 2025 1 commit

sched: logging improvements (#10550) · 76ea735a

Daniel Hiltgen authored May 03, 2025

This enhances our logging in the scheduler. The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state). Runners now have slog wiring to report more details about the
runner, including PID.

76ea735a

01 May, 2025 1 commit
- image: add vision capability for projector-based models (#10509) · e6d2d041
  frob authored May 02, 2025
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  e6d2d041