Commits · 68e04c7ff88dca128016f75dc5bbd2f794bd2028 · OpenDAS / ollama

17 Oct, 2025 1 commit

test: harden scheduler tests (#12662) · 68e04c7f

Daniel Hiltgen authored Oct 17, 2025

* test: harden scheduler tests

This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.

* test: tune tests for partial loads

Give stress tests more time when the model is split between CPU/GPU

68e04c7f

11 Oct, 2025 1 commit
- Reapply "add truncate and shift parameters" (#12582) · 6544e147
  Jeffrey Morgan authored Oct 11, 2025
  
  6544e147
10 Oct, 2025 1 commit
- thinking: allow `"think": false` for non-thinking models (#12555) · d681cd7c
  Patrick Devine authored Oct 09, 2025
  
  d681cd7c
09 Oct, 2025 3 commits
- routes: structured outputs for gpt-oss (#12460) · 77060d46
  Parth Sareen authored Oct 08, 2025
  
  77060d46
- Revert "add truncate and shift parameters (#12519)" (#12545) · 7d965258
  Jeffrey Morgan authored Oct 08, 2025
```
This reverts commit 6a62b894.
```
  7d965258
- add truncate and shift parameters (#12519) · 6a62b894
  Jeffrey Morgan authored Oct 08, 2025
  
  6a62b894
08 Oct, 2025 1 commit
- thinking: turn on thinking mode for all reasoning models (#12533) · 90d429f5
  Patrick Devine authored Oct 08, 2025
  
  90d429f5
01 Oct, 2025 1 commit

Use runners for GPU discovery (#12090) · bc8909fb

Daniel Hiltgen authored Oct 01, 2025

This revamps how we discover GPUs in the system by leveraging the Ollama
runner. This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs. Now the runner does that implicitly based on the actual
device list. In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.

bc8909fb

22 Aug, 2025 1 commit
- server: skip parsing initial <think> if provided in the prompt (#12024) · 4be4dc87
  Jeffrey Morgan authored Aug 22, 2025
  
  4be4dc87
14 Aug, 2025 1 commit

llm: New memory management · d5a0d8d9

Jesse Gross authored May 29, 2025

This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).

It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.

d5a0d8d9

05 Aug, 2025 2 commits

tools: support anyOf types · 30f8a68c

Devon Rifkin authored Aug 05, 2025

afaik gpt-oss is the first model that meaningfully transforms tool
function definitions in its template. We found that relatively common
definitions that include `anyOf` were not working because the template
was assuming that types were always defined via a `type` field.

anyOf allows for fully recursive types, so I exposed a
`toTypeScriptType()` function to handle this recursive logic in go and
keep the templates cleaner. The gpt-oss templates will need to be
updated to use this.

We should keep building out our function definition support to more
fully support the parts of json schema that make sense for this use
case, but in the meantime this will unblock some users (e.g., zed's
ollama integration w/ gpt-oss). Probably the most urgent is proper array
support

30f8a68c

gpt-oss (#11672) · fa7776fd

Michael Yang authored Aug 05, 2025



* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>

fa7776fd

29 May, 2025 1 commit

add thinking support to the api and cli (#10584) · 5f57b0ef

Devon Rifkin authored May 28, 2025

- Both `/api/generate` and `/api/chat` now accept a `"think"`
  option that allows specifying whether thinking mode should be on or
  not
- Templates get passed this new option so, e.g., qwen3's template can
  put `/think` or `/no_think` in the system prompt depending on the
  value of the setting
- Models' thinking support is inferred by inspecting model templates.
  The prefix and suffix the parser uses to identify thinking support is
  also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
  existing API consumers. If the `"think"` option is not specified, the
  behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
  in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
  or `--think=false` to control thinking, or during an interactive
  session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
  it easy to use thinking in scripting scenarios like
  `ollama run qwen3 --think --hidethinking "my question here"` where you
  just want to see the answer but still want the benefits of thinking
  models

5f57b0ef

08 May, 2025 1 commit
- lint: enable usetesting, disable tenv (#10594) · 6e9a7a25
  Michael Yang authored May 08, 2025
  
  6e9a7a25
06 May, 2025 1 commit

Move quantization to new backend (#10363) · 42481045

Daniel Hiltgen authored May 06, 2025

* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.

42481045

28 Apr, 2025 1 commit
- Revert "increase default context length to 4096 (#10364)" · dd93e1af
  Devon Rifkin authored Apr 28, 2025
```
This reverts commit 424f6486.
```
  dd93e1af
22 Apr, 2025 1 commit

increase default context length to 4096 (#10364) · 424f6486

Devon Rifkin authored Apr 22, 2025

* increase default context length to 4096

We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.

We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.

We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.

* fix tests

add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.

* tweak small gpu message

* clarify context length default

also make it actually show up in `ollama serve --help`

424f6486

10 Apr, 2025 1 commit
- types: include the 'items' and '$defs' fields to properly handle "array" types (#10091) · ef65174d
  Tom Sheffler authored Apr 09, 2025
```
---------
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
```
  ef65174d
08 Apr, 2025 1 commit
- types: add any type and validation for ToolFunction enum (#10166) · 6747099d
  Parth Sareen authored Apr 08, 2025
  
  6747099d
07 Apr, 2025 1 commit
- types: allow tool function parameters with a single type or an array of types (#9434) · 2f723ac2
  Alex Rozgo authored Apr 07, 2025
  
  2f723ac2
03 Apr, 2025 1 commit

llm: set done reason at server level (#9830) · e53b3cbd

Bruce MacDonald authored Apr 03, 2025

No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.

e53b3cbd

14 Feb, 2025 1 commit

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

01 Jan, 2025 1 commit
- Update the /api/create endpoint to use JSON (#7935) · 86a622cb
  Patrick Devine authored Dec 31, 2024
```
Replaces `POST /api/create` to use JSON instead of a Modelfile.

This is a breaking change.
```
  86a622cb
11 Dec, 2024 1 commit
- server: more support for mixed-case model names (#8017) · b1fd7fef
  Blake Mizerany authored Dec 11, 2024
```
Fixes #7944
```
  b1fd7fef
27 Nov, 2024 1 commit
- api: enable tool streaming (#7836) · ce7455a8
  Parth Sareen authored Nov 27, 2024
  
  ce7455a8
18 Oct, 2024 1 commit

image processing for llama3.2 (#6963) · c7cb0f06

Patrick Devine authored Oct 18, 2024


Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>

c7cb0f06

17 Oct, 2024 1 commit
- Rename gpu package discover (#7143) · 05cd82ef
  Daniel Hiltgen authored Oct 16, 2024
```
Cleaning up go package naming
```
  05cd82ef
27 Aug, 2024 1 commit
- server: clean up route names for consistency (#6524) · 47fa0839
  Jeffrey Morgan authored Aug 26, 2024
  
  47fa0839
31 Jul, 2024 2 commits
- comments · df993fa3
  Michael Yang authored Jul 08, 2024
  
  df993fa3
- refactor convert · 5e9db9fb
  Michael Yang authored May 31, 2024
  
  5e9db9fb
20 Jul, 2024 1 commit
- Fix generate test flakyness (#5804) · 69a2d4cc
  Jeffrey Morgan authored Jul 19, 2024
  
  69a2d4cc
16 Jul, 2024 2 commits
- add suffix support to generate endpoint · d290e875
  Michael Yang authored Jun 20, 2024
```
this change is triggered by the presence of "suffix", particularly
useful for code completion tasks
```
  d290e875
- add chat and generate tests with mock runner · 4a565cbf
  Michael Yang authored Jul 13, 2024
  
  4a565cbf